At the end of the Fusus workflow the OCRed Afifi edition is produced as a .tsv
file.
Cornelis has imported this file into Pandas, cleaned it, and saved it as a .csv
file.
In that process, the punctuation after words has gone missing.
We reinsert that, based on the .tsv
file.
However, the amount of rows in both files is not equal, the cleaned file has ca. 8000 rows less.
So, we have to look carefully which punctuation we are going to reinsert.
We should not insert punctuation of the form (
and )
.
,
and call that the key of the row.added
if we have
added the punctuation or ignored
if we have ignored it0
in the cleaned file.
We restore those zeroes to empty strings in the resulting tweaked file.import os
import collections
BASE_DIR = os.path.expanduser("~/github/among/fusus")
CLEAN_FILE = f"{BASE_DIR}/fusust-text-laboratory/AfifiCleaned.csv"
ORIG_FILE = f"{BASE_DIR}/ur/Afifi/allpages.tsv"
TWEAK_FILE = f"{BASE_DIR}/fusust-text-laboratory/AfifiTweaked.csv"
CLEAN_MISSING = f"{BASE_DIR}/fusust-text-laboratory/AfifiDeletedRows.tsv"
ORIG_MISSING = f"{BASE_DIR}/fusust-text-laboratory/AfifiNotFoundRows.tsv"
PUNC_ADDED = f"{BASE_DIR}/fusust-text-laboratory/AfifiAddedPunc.tsv"
We make an index with keys the combination of page, stripe, column, line, left, top, right, bottom fields, and as values tuples of the rest of the fields.
We detect when multiple rows have the same keys.
We do this for both the original and the cleaned file.
Note that both files have different field separators, so we pass it.
We pass correct=2
to replace zeroes in column 2 by empty strings; we need it for the cleaned file.
def makeIndex(path, label, sep, correct=None):
rowIndex = {}
duplicateKeys = {}
with open(path) as fh:
next(fh)
for line in fh:
fields = line.rstrip("\n").split(sep)
if correct is not None:
if fields[correct] == "0":
fields[correct] = ""
key = ",".join(fields[0:8])
value = tuple(fields[8:])
if key in rowIndex:
if key in duplicateKeys:
duplicateKeys[key].append(value)
else:
duplicateKeys[key] = [rowIndex[key], value]
rowIndex[key] = value
print(f"INFO: {label}: There are {len(rowIndex)} keys")
if duplicateKeys:
print(f"WARNING: {label}: There are {len(duplicateKeys)} keys with multiple rows")
else:
print(f"OK: {label}: No keys with multiple rows")
return rowIndex
origRowIndex = makeIndex(ORIG_FILE, "original file", "\t")
cleanRowIndex = makeIndex(CLEAN_FILE, "cleaned file", ",", correct=2)
INFO: original file: There are 48871 keys OK: original file: No keys with multiple rows INFO: cleaned file: There are 40271 keys OK: cleaned file: No keys with multiple rows
So far so good.
Report the cases where a key of the cleaned file cannot be found in the original file and vice versa
def checkIndex(sourceIndex, sourceLabel, targetIndex, targetLabel, path):
"""Report the keys in targetIndex that are not in sourceIndex.
Write the offending keys to path.
"""
n = 0
with open(path, "w") as fh:
fh.write("key\tvalue\n")
for key in targetIndex:
if key not in sourceIndex:
value = ",".join(targetIndex[key])
fh.write(f"{key}\t{value}\n")
n += 1
if n == 0:
print(f"OK: all {targetLabel} keys are also {sourceLabel} keys")
else:
print(f"WARNING: {n} {targetLabel} keys are not a {sourceLabel} key")
print(f"See {path}\n")
checkIndex(origRowIndex, "original", cleanRowIndex, "cleaned", ORIG_MISSING)
checkIndex(cleanRowIndex, "cleaned", origRowIndex, "original", CLEAN_MISSING)
OK: all cleaned keys are also original keys WARNING: 8600 original keys are not a cleaned key See /Users/dirk/github/among/fusus/fusust-text-laboratory/AfifiDeletedRows.tsv
Good.
We now know a lot of things:
We can now reliably add the punctuation row from the original to the cleaned row and put it in the tweaked file.
We also produce a file that lists the non-empty added punctuation.
twf = open(TWEAK_FILE, "w")
twf.write("page,stripe,column,line,left,top,right,bottom,confidence,letters,punc\n")
taf = open(PUNC_ADDED, "w")
taf.write("key\tpunc\tletters\torigletters\n")
nTotal = 0
nNonEmpty = 0
nNonWhite = 0
nIgnored = 0
for (key, value) in cleanRowIndex.items():
origValue = origRowIndex[key]
(confidence, letters) = value[0:2]
(origConfidence, origLetters, punc) = origValue[0:3]
if "(" in punc or ")" in punc:
ignored = True
puncRep = punc.replace("(", "").replace(")", "")
else:
ignored = False
puncRep = punc
twf.write(f"{key},{confidence},{letters},{puncRep}\n")
nTotal += 1
if punc:
nNonEmpty += 1
if punc != " ":
nNonWhite += 1
if ignored:
nIgnored += 1
ignoredRep = "ignored"
else:
ignored = False
ignoredRep = "added"
taf.write(f"{key}\t{ignoredRep}\t{punc},{letters},{origLetters}\n")
twf.close()
taf.close()
print(f"""Punc fields
{nTotal} rows with:
{nNonEmpty} times non-empty punctuation of which:
{nNonWhite} times not a space of which:
{nIgnored} times ignored and replaced by a space
""")
Punc fields 40271 rows with: 37872 times non-empty punctuation of which: 924 times not a space of which: 436 times ignored and replaced by a space