Notebook

Tweak the Afifi cleaned file¶

At the end of the Fusus workflow the OCRed Afifi edition is produced as a .tsv file.

Cornelis has imported this file into Pandas, cleaned it, and saved it as a .csv file. In that process, the punctuation after words has gone missing.

We reinsert that, based on the .tsv file. However, the amount of rows in both files is not equal, the cleaned file has ca. 8000 rows less.

So, we have to look carefully which punctuation we are going to reinsert.

We should not insert punctuation of the form ( and ).

Actions¶

The rows in the cleaned file are authoritative. Rows have been deleted from the original file for a reason.
We will not add rows to the cleaned file.
We compare rows in both files on the bases of the first columns: page, stripe, column, line, direction, left, top, right, bottom; we join these fields with a , and call that the key of the row.
We perform sanity checks on the keys to guarantee that there is a 1-1 correspondence between keys and rows
We write diagnostic files containing the keys that are in one file but not in the other (cleaned and original)
We write a file with the non-empty, non-space punctuation in it, with an column with the word added if we have added the punctuation or ignored if we have ignored it
We produce a tweaked cleaned file, AfifiTweaked.csv, based on AfifiCleaned.csv, with
- the same rows, but each row having an extra field at the end with punctuation from a corresponding row in the original.
- The value for column in the original is quite often the empty string, and has been converted to 0 in the cleaned file. We restore those zeroes to empty strings in the resulting tweaked file.

In [1]:

import os
import collections

In [2]:

BASE_DIR = os.path.expanduser("~/github/among/fusus")

CLEAN_FILE = f"{BASE_DIR}/fusust-text-laboratory/AfifiCleaned.csv"
ORIG_FILE = f"{BASE_DIR}/ur/Afifi/allpages.tsv"
TWEAK_FILE = f"{BASE_DIR}/fusust-text-laboratory/AfifiTweaked.csv"

CLEAN_MISSING = f"{BASE_DIR}/fusust-text-laboratory/AfifiDeletedRows.tsv"
ORIG_MISSING = f"{BASE_DIR}/fusust-text-laboratory/AfifiNotFoundRows.tsv"

PUNC_ADDED = f"{BASE_DIR}/fusust-text-laboratory/AfifiAddedPunc.tsv"

Read the original file and make an index¶

We make an index with keys the combination of page, stripe, column, line, left, top, right, bottom fields, and as values tuples of the rest of the fields.

We detect when multiple rows have the same keys.

We do this for both the original and the cleaned file.

Note that both files have different field separators, so we pass it.

We pass correct=2 to replace zeroes in column 2 by empty strings; we need it for the cleaned file.

In [3]:

def makeIndex(path, label, sep, correct=None):
    rowIndex = {}
    duplicateKeys = {}

    with open(path) as fh:
        next(fh)
        for line in fh:
            fields = line.rstrip("\n").split(sep)
            if correct is not None:
                if fields[correct] == "0":
                    fields[correct] = ""
            key = ",".join(fields[0:8])
            value = tuple(fields[8:])

            if key in rowIndex:
                if key in duplicateKeys:
                    duplicateKeys[key].append(value)
                else:
                    duplicateKeys[key] = [rowIndex[key], value]
            rowIndex[key] = value

    print(f"INFO: {label}: There are {len(rowIndex)} keys")

    if duplicateKeys:
        print(f"WARNING: {label}: There are {len(duplicateKeys)} keys with multiple rows")
    else:
        print(f"OK: {label}: No keys with multiple rows")
    
    return rowIndex
        

origRowIndex = makeIndex(ORIG_FILE, "original file", "\t")
cleanRowIndex = makeIndex(CLEAN_FILE, "cleaned file", ",", correct=2)

INFO: original file: There are 48871 keys
OK: original file: No keys with multiple rows
INFO: cleaned file: There are 40271 keys
OK: cleaned file: No keys with multiple rows

So far so good.

Check for missing keys¶

Report the cases where a key of the cleaned file cannot be found in the original file and vice versa

In [4]:

def checkIndex(sourceIndex, sourceLabel, targetIndex, targetLabel, path):
    """Report the keys in targetIndex that are not in sourceIndex.
    
    Write the offending keys to path.
    """
    
    n = 0
        
    with open(path, "w") as fh:
        fh.write("key\tvalue\n")
        
        for key in targetIndex:
            if key not in sourceIndex:
                value = ",".join(targetIndex[key])
                fh.write(f"{key}\t{value}\n")
                n += 1
            
    if n == 0:
        print(f"OK: all {targetLabel} keys are also {sourceLabel} keys")
    else:
        print(f"WARNING: {n} {targetLabel} keys are not a {sourceLabel} key")
        print(f"See {path}\n")
    
    
checkIndex(origRowIndex, "original", cleanRowIndex, "cleaned", ORIG_MISSING)
checkIndex(cleanRowIndex, "cleaned", origRowIndex, "original", CLEAN_MISSING)

OK: all cleaned keys are also original keys
WARNING: 8600 original keys are not a cleaned key
See /Users/dirk/github/among/fusus/fusust-text-laboratory/AfifiDeletedRows.tsv

Good.

We now know a lot of things:

We can identify rows by their keys, both in the original and in the cleaned files.
We find exactly one matching original row for each cleaned row.
We have an overview of all original rows that did not make it to the cleaned file:

Produce the tweaked file¶

We can now reliably add the punctuation row from the original to the cleaned row and put it in the tweaked file.

We also produce a file that lists the non-empty added punctuation.

In [7]:

twf = open(TWEAK_FILE, "w")
twf.write("page,stripe,column,line,left,top,right,bottom,confidence,letters,punc\n")

taf = open(PUNC_ADDED, "w")
taf.write("key\tpunc\tletters\torigletters\n")

nTotal = 0
nNonEmpty = 0
nNonWhite = 0
nIgnored = 0

for (key, value) in cleanRowIndex.items():
    origValue = origRowIndex[key]
    
    (confidence, letters) = value[0:2]
    (origConfidence, origLetters, punc) = origValue[0:3]
    
    if "(" in punc or ")" in punc:
        ignored = True
        puncRep = punc.replace("(", "").replace(")", "")
    else:
        ignored = False
        puncRep = punc
    twf.write(f"{key},{confidence},{letters},{puncRep}\n")
    nTotal += 1
    
    if punc:
        nNonEmpty += 1
        if punc != " ":
            nNonWhite += 1
            if ignored:
                nIgnored += 1
                ignoredRep = "ignored"
            else:
                ignored = False
                ignoredRep = "added"
            taf.write(f"{key}\t{ignoredRep}\t{punc},{letters},{origLetters}\n")
        
twf.close()
taf.close()

print(f"""Punc fields

{nTotal} rows with:
{nNonEmpty} times non-empty punctuation of which:
{nNonWhite} times not a space of which:
{nIgnored} times ignored and replaced by a space
""")

Punc fields

40271 rows with:
37872 times non-empty punctuation of which:
924 times not a space of which:
436 times ignored and replaced by a space

In [ ]: