Visualise column detection results

I ran my column detection script across all 199 volumes and 70,000+ pages of the Sydney Stock Exchange registers, generating a CSV file for each year from 1901 to 1950. See this notebook for some more details

This notebook combines the separate CSV files into a single dataframe and then visualises the column detection results using Altair.

In [1]:
import pandas as pd
import os
import altair as alt
from IPython.display import display, HTML

# Comment the next line out if using in JupyterLab

In [2]:
# We're going to combine all of the CSV files into one big dataframe

# Create an empty dataframe
combined_df = pd.DataFrame()

# Loop through the range of years
for year in range(1901, 1951):
    # Open the CSV file for that year as a dataframe
    year_df = pd.read_csv('{}.csv'.format(year))
    # Add the single year df to the combined df
    combined_df = combined_df.append(year_df)

To visualise the results, we first need to move the data around a bit to get it into a form we can chart using Altair.

Because this loops through dataframes for each volume calculating column widths, it is a bit slow...

In [3]:
# Replace any NaNs with 0
combined_df['column_positions'].fillna('0', inplace=True)

# Group images by volume using the 'referenceCode' column
vol_groups = combined_df.groupby('referenceCode')

# Create an empty dataframe
df_volumes = pd.DataFrame()

# Just used to limit the number of volumes processed for testing
max_rows = 200
row = 0

# Loop through the volume groups
for vol, pages in vol_groups:
    if row < max_rows:
        # Pages is a dataframe with details of all the pages in a volume
        pages = pages.copy()
        # Convert the width to a string, so we concatenate with the column positions
        pages['width'] = pages['width'].astype('str')
        # Add the page width to the end of the column positions string
        pages['column_positions'] = pages[['column_positions', 'width']].apply(lambda x: ','.join(x), axis=1)
        # Create a new dataframe by exploding the column positions into separate rows, using the file name as the index
        new_df = pd.DataFrame(pages['column_positions'].str.split(',').tolist(), index=pages['name']).stack()
        # Change the file name from an index into a column
        new_df = new_df.reset_index([0, 'name'])
        # Rename the columns
        new_df.columns = ['name', 'col_pos']
        # Convert col_pos column to an integer
        new_df['col_pos'] = new_df['col_pos'].astype('int')
        # Extract the page number from the file name and save as a new field
        new_df['page_num'] = new_df['name'].str.extract(r'_(\d+)\.').astype('int')
         # Add the volume name
        new_df['volume'] = vol
        # Add the volume number
        new_df['volume_num'] = int(vol[-3:])
        # Sort pages by the volume number / page number
        new_df = new_df.sort_values(by=['volume_num','page_num'])
        # Number the columns in each row
        new_df['order'] = new_df.groupby('page_num').cumcount()
        # In this loop we're going to calculate the width of each column by subtracting
        # the previous col_pos from the current one
        # Loop through the dataframe
        for i in range(1, len(new_df)):
            # if it's the first column then the width is equal to the position
            if new_df.loc[i, 'order'] == 0:
                new_df.loc[i, 'col_width'] = new_df.loc[i, 'col_pos']
            # Otherwise we have to get the previous value and subtract
                new_df.loc[i, 'col_width'] = new_df.loc[i, 'col_pos'] - new_df.loc[i-1, 'col_pos']
        # Append to the combined volumes dataframe
        df_volumes = df_volumes.append(new_df)
    row += 1

Let's have a peek inside the new dataframe.

In [4]:
name col_pos page_num volume volume_num order col_width
0 N193-001_0001.tif 0 1 N193-001 1 0 NaN
1 N193-001_0001.tif 1811 1 N193-001 1 1 1811.0
2 N193-001_0001.tif 3222 1 N193-001 1 2 1411.0
3 N193-001_0001.tif 6237 1 N193-001 1 3 3015.0
4 N193-001_0002.tif 205 2 N193-001 1 0 205.0

Chart a single volume

Now we have a dataframe that has one row for every column on every page in every volume. Let's visualise a single volume by creating a bar chart that stacks the column widths on top of each other to represent a page. In other words, there'll be one column in the chart for every page in the volume, and the total height of the column will represent the width of the digitised image. The widths of individual table columns will be shown by different coloured segments.

We'll also add a couple of extra features. When you hover over the chart, a tooltip will display the volume, page, and column details. And when you click, you'll download the image of that page from CloudStor (you'll need to be logged in to CloudStor for this to work).

In [5]:
def chart_volume(df):
    chart = alt.Chart(df).transform_calculate(
        # Create a url that will download a page image when clicked
        url='' + alt.datum.volume + '/' +
    # X axis shows pages
    x=alt.X('page_num:O', axis=alt.Axis(labels=False, ticks=False), title='', scale=alt.Scale(rangeStep=7)),
    # Y axis stacks the table column widths together to show total page width (in pixels)
    y=alt.Y('col_width:Q', axis=alt.Axis(labels=False, ticks=False), title=''),
    # Individual table column widths are distinguished by color
    color=alt.Color('order:N', legend=alt.Legend(title='Column')),
    # Hover to see page and column details
    tooltip=['volume', 'page_num', 'col_width'],
    # On click, download the image
    # Order the coloured segments by column order
    order=alt.Order('order', sort='ascending')
    return chart

Create a chart! Change the volume value as desired...

In [6]:
volume = 'N193-001'
chart = chart_volume(df_volumes.loc[df_volumes['volume'] == volume])

Chart all the volumes

We can visualise all 199 volumes much the same way. To make the volumes a bit easier to read, we'll display them vertically. We'll also use Altair's 'facet' option to show lots of different charts at once, as it's more efficient that generating each chart individually. However, after a bit of trial and error, it seems there's a limit on the number of facets you can display. We'll deal with this by dividing the dataframe roughly into halves, and visualising each separately.

You can change the number of charts in each row by changing the num_columns value.

In [7]:
# Much the same as above, but note the use of facet to generate multiple charts
def chart_volumes(df, num_columns=25):
    chart = alt.Chart(df).mark_bar(size=3).encode(
    y=alt.X('page_num:O', axis=alt.Axis(labels=False, ticks=False), title='', scale=alt.Scale(rangeStep=3.4)),
    x=alt.Y('col_width:Q', axis=alt.Axis(labels=False, ticks=False), title=''),
    color=alt.Color('order:N', legend=None),
    tooltip=['volume', 'page_num', 'col_width'],
    order=alt.Order('order', sort='ascending')
        # Set the width
        # This creates a separate chart for each volume
        # Number of charts in a row
    return chart
In [8]:
# Display the first 100 volumes
chart1 = chart_volumes(df_volumes.loc[df_volumes['volume_num'] <= 100], num_columns=20)

# Display the chart

# Save the chart as html'docs/chart-vols-1-100.html')