Transkribus is a desktop application fronted tool for transcribing handwritten texts and generating machine learning models that can then be used to automatically transcribe such texts.
As well as the desktop app, there's also a simple python API for the Transkribus web service [docs and wiki].
The wiki docs are primarily focused on calling the API from the command line.
Whilst we can use CLI calls from %bash
cells in a notebook, and so doing documenting (or helping automate) a workflow, I'm more interested in using the package in a Python scripting context.
So what can we do with it?
At the moment, the package is not pip installable, so we need to download the repo and then import the required module from a file:
%%capture
!wget https://github.com/Transkribus/TranskribusPyClient/archive/master.zip
!unzip master.zip
%cd TranskribusPyClient-master/src
from TranskribusPyClient.client import TranskribusClient
%cd ../..
/Users/tonyhirst/Documents/Luddites/notebooks/TranskribusPyClient-master/src /Users/tonyhirst/Documents/Luddites
TranskribusClient
¶To get started with the Python Transkribus API client, we need to get an instance of it:
t = TranskribusClient()
The service that client calls is an authenticated one, so we need to supply credentials for it:
from getpass import getpass
user = input('User: ')
pwd = getpass('Password: ')
User: tony.hirst@gmail.com Password: ········
Logging in updates the state of the client:
t.auth_login(user, pwd)
/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py:857: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings InsecureRequestWarning)
True
#Review the methods available from the client
#dir(t)
Many of the client calls require a collection ID, but I can see how to request available collection IDs directly?
We can, however, find client ID values from any recent jobs we've run, such as jobs run via the desktop application:
#We can find recent collection IDs from jobs...
jobs = t.getJobs()
colIds = list({j[k] for j in jobs for k in j if k=='colId'})
colIds_str = ', '.join([str(c) for c in colIds])
print(f'Recent collectionIDs: {colIds_str}')
/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py:857: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings InsecureRequestWarning)
Recent collectionIDs: 54474
We can then get a list of documents associated with a particular collection:
#How do we get the collectionId?
docs = t.listDocsByCollectionId(colIds[0])
md=''
for doc in docs:
md = f"{md}\n### docId: {doc['docId']}\n{doc['title']}, {doc['nrOfPages']} pages\n"
print(md)
/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py:857: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings InsecureRequestWarning)
### docId: 268862 Dutch Handwriting 0.1, 1 pages ### docId: 268863 English Handwriting 0.1, 1 pages ### docId: 268864 Wiener Diarium 4.0, 1 pages ### docId: 268865 German Handwriting 0.1, 1 pages ### docId: 268888 luddite_png_test, 1 pages ### docId: 268910 HO-40-1_15, 36 pages
We can download the XML metadata associated with a document's transcripts either as a parsed lxml.etree
document (bParse=True [default]
) or as a text string.
col_id = colIds[0]
#doc_id = [d['docId'] for d in docs if d['title']=='English Handwriting 0.1'][0]
doc_id = 268888
#Get XML for a doc
_xml_str = t.getDocByIdAsXml(col_id, doc_id, bParse=False)
/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py:857: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings InsecureRequestWarning)
If we grab the text string, we can use the convenient xmltodict
Python package to convert the XML text to a dict
:
#!pip3 install xmltodict
import xmltodict
xml_dict = xmltodict.parse(_xml_str)
xml_dict.keys()
odict_keys(['trpDoc'])
The document comes in three parts:
md
)pageList
]collection
]xml_dict['trpDoc'].keys()
odict_keys(['md', 'pageList', 'collection'])
xml_dict['trpDoc']['md']
OrderedDict([('nrOfRegions', '10'), ('nrOfTranscribedRegions', '5'), ('nrOfWordsInRegions', '69'), ('nrOfLines', '31'), ('nrOfTranscribedLines', '23'), ('nrOfWordsInLines', '87'), ('nrOfWords', '0'), ('nrOfTranscribedWords', '0'), ('nrOfNew', '0'), ('nrOfInProgress', '1'), ('nrOfDone', '0'), ('nrOfFinal', '0'), ('nrOfGT', '0'), ('docId', '268888'), ('title', 'luddite_png_test'), ('uploadTimestamp', '1574708953013'), ('uploader', 'tony.hirst@gmail.com'), ('uploaderId', '51319'), ('nrOfPages', '1'), ('pageId', '11245035'), ('url', 'https://files.transkribus.eu/Get?id=BKBQUQKQPGATSRFUKPTFMAEM&fileType=view'), ('thumbUrl', 'https://files.transkribus.eu/Get?id=BKBQUQKQPGATSRFUKPTFMAEM&fileType=thumb'), ('status', '0'), ('fimgStoreColl', 'TrpDoc_DEA_268888'), ('origDocId', '0'), ('collectionList', OrderedDict([('colList', OrderedDict([('colId', '54474'), ('colName', 'tony.hirst@gmail.com Collection'), ('description', 'tony.hirst@gmail.com'), ('crowdsourcing', 'false'), ('elearning', 'false'), ('nrOfDocuments', '0')]))]))])
for k in xml_dict['trpDoc']['collection'].keys():
print(k,xml_dict['trpDoc']['collection'][k])
colId 54474 colName tony.hirst@gmail.com Collection description tony.hirst@gmail.com crowdsourcing false elearning false nrOfDocuments 0
The pageList
structure contains transcript information about each page:
xml_dict['trpDoc']['pageList']['pages']['tsList']['transcripts'][0]
OrderedDict([('tsId', '19850473'), ('parentTsId', '19850334'), ('key', 'JZWYSDUSQHFJLVMGHZXSKBLY'), ('pageId', '11245035'), ('docId', '268888'), ('pageNr', '1'), ('url', 'https://files.transkribus.eu/Get?id=JZWYSDUSQHFJLVMGHZXSKBLY'), ('status', 'IN_PROGRESS'), ('userName', 'tony.hirst@gmail.com'), ('userId', '51319'), ('timestamp', '1574711666942'), ('md5Sum', None), ('nrOfRegions', '10'), ('nrOfTranscribedRegions', '5'), ('nrOfWordsInRegions', '69'), ('nrOfLines', '31'), ('nrOfTranscribedLines', '23'), ('nrOfWordsInLines', '87'), ('nrOfWords', '0'), ('nrOfTranscribedWords', '0')])
We can download complete Transkribus documents given a collection and document ID.
The complete document contains the coordinates for segmented text regions as well as the transcript for each region Image downloads for each page are enabaled by default (bNoImage=False
).
If the specified download folder does not exist, create it. If it does exist, and bForce=False [default]
an error is raised; if the argument is true, delete the directory, and recreate an empty one of the same name.
download_dir = 'testDOwn'
t.download_document(col_id, doc_id, download_dir)
/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py:857: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings InsecureRequestWarning) /usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py:857: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings InsecureRequestWarning)
.
/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py:857: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings InsecureRequestWarning)
(1574711666942, ['HO-40-2_13_14'])
!ls $download_dir
HO-40-2_13_14.png HO-40-2_13_14.pxml max.ts trp.json
The trp.json
file looks like it's the Transkribus data structure we could download as the XML file.
(It would be nice if getDocByIdAsXml()
were refactored as getDocById()
with a switch allowing for xml|json
and the bParse
flag, when set, returning the etree
or a Python dict
correspondingly.)
!cat $download_dir/trp.json
{ "collection": { "colId": 54474, "colName": "tony.hirst@gmail.com Collection", "crowdsourcing": false, "description": "tony.hirst@gmail.com", "elearning": false, "nrOfDocuments": 0 }, "edDeclList": [], "md": { "collectionList": { "colList": [ { "colId": 54474, "colName": "tony.hirst@gmail.com Collection", "crowdsourcing": false, "description": "tony.hirst@gmail.com", "elearning": false, "nrOfDocuments": 0 } ] }, "docId": 268888, "fimgStoreColl": "TrpDoc_DEA_268888", "nrOfDone": 0, "nrOfFinal": 0, "nrOfGT": 0, "nrOfInProgress": 1, "nrOfLines": 31, "nrOfNew": 0, "nrOfPages": 1, "nrOfRegions": 10, "nrOfTranscribedLines": 23, "nrOfTranscribedRegions": 5, "nrOfTranscribedWords": 0, "nrOfWords": 0, "nrOfWordsInLines": 87, "nrOfWordsInRegions": 69, "origDocId": 0, "pageId": 11245035, "status": 0, "thumbUrl": "https://files.transkribus.eu/Get?id=BKBQUQKQPGATSRFUKPTFMAEM&fileType=thumb", "title": "luddite_png_test", "uploadTimestamp": 1574708953013, "uploader": "tony.hirst@gmail.com", "uploaderId": 51319, "url": "https://files.transkribus.eu/Get?id=BKBQUQKQPGATSRFUKPTFMAEM&fileType=view" }, "pageList": { "pages": [ { "created": "2019-11-25T20:09:13.309+01:00", "docId": 268888, "height": 1753, "imageId": 7700237, "imageVersions": { "imageVersions": [] }, "imgFileName": "HO-40-2_13_14.png", "indexed": false, "key": "BKBQUQKQPGATSRFUKPTFMAEM", "pageId": 11245035, "pageNr": 1, "tagsStored": "2019-11-25T20:54:27.341+01:00", "thumbUrl": "https://files.transkribus.eu/Get?id=BKBQUQKQPGATSRFUKPTFMAEM&fileType=thumb", "tsList": { "transcripts": [ { "docId": 268888, "key": "JZWYSDUSQHFJLVMGHZXSKBLY", "md5Sum": "", "nrOfLines": 31, "nrOfRegions": 10, "nrOfTranscribedLines": 23, "nrOfTranscribedRegions": 5, "nrOfTranscribedWords": 0, "nrOfWords": 0, "nrOfWordsInLines": 87, "nrOfWordsInRegions": 69, "pageId": 11245035, "pageNr": 1, "parentTsId": 19850334, "status": "IN_PROGRESS", "timestamp": 1574711666942, "tsId": 19850473, "url": "https://files.transkribus.eu/Get?id=JZWYSDUSQHFJLVMGHZXSKBLY", "userId": 51319, "userName": "tony.hirst@gmail.com" } ] }, "url": "https://files.transkribus.eu/Get?id=BKBQUQKQPGATSRFUKPTFMAEM&fileType=view", "width": 2480 } ] } }
The png
file is an image file for the page:
from IPython.display import Image
Image('testDOwn/HO-40-2_13_14.png')
The pxml
file contains the line segmentationa and transcript information.
The aim here is to see if we can take the XML from a document that has already been segmented and transcribed and process it in some way. For example, crop out each line from the image file and place it in a markdown document, along with the transcription of that line.
We could represent the pxml
file as an ordered Python dict using xmltodict
, or we could parse it as an XML document. Let's do the latter for now:
import xml.etree.ElementTree as ET
def transkribus_tree(fn):
with open(fn) as f:
_xml = f.read()
#Clean the XML of namespace cruft
_xml = _xml.replace('xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15 http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15/pagecontent.xsd"','')
#Parse the XML
tree = ET.fromstring(_xml)
return tree
tree = transkribus_tree('testDown/HO-40-2_13_14.pxml')
For each text region, which may contain one or more text lines, each with its own transcript, a complete transcript is also provided:
!head -n 200 testDown/HO-40-2_13_14.pxml | tail -n 30
<Unicode>411</Unicode> </TextEquiv> </TextLine> <TextEquiv> <Unicode>The National Archives' reference HO 40/2 Halifax 16th October 1812 15 Sir I beg leave to inform you that I have made every enquiry respecting the robberies, lately committed in this neighbourhood; and I am and I am sorry I have not been able to gain // any further information than what I have sent formerly. Tuesday, the 13th Inst. I have the honor to be Sir Your Most Obedient Servant Stirling Militia 411</Unicode> </TextEquiv> </TextRegion> <TextRegion orientation="0.0" id="r3" custom="readingOrder {index:2;}"> <Coords points="2037,383 2037,539 2049,539 2049,383"/> <TextEquiv>
We can parse this out as follows:
for t in tree.findall('Page/TextRegion/TextEquiv/Unicode'):
print('...\n',t.text)
... 588 B ... The National Archives' reference HO 40/2 Halifax 16th October 1812 15 Sir I beg leave to inform you that I have made every enquiry respecting the robberies, lately committed in this neighbourhood; and I am and I am sorry I have not been able to gain // any further information than what I have sent formerly. Tuesday, the 13th Inst. I have the honor to be Sir Your Most Obedient Servant Stirling Militia 411 ... None ... 3 ... None ... 1 2 3 4 ... None ... COPYRIGHT PHOTOGRAPH - NOT TO BE
We can also pull out the co-ordinates defining lines identified within each text region.
Here's what the XML looks like:
!head -n 45 testDown/HO-40-2_13_14.pxml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15 http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15/pagecontent.xsd"> <Metadata> <Creator>prov=University of Rostock/Institute of Mathematics/CITlab/Tobias Gruening/tobias.gruening@uni-rostock.de:name=/net_tf/LA73_249_0mod360.pb:de.uros.citlab.segmentation.CITlab_LA_ML:v=2.4.2 prov=University of Rostock/Institute of Mathematics/CITlab/Gundram Leifert/gundram.leifert@uni-rostock.de:name=English Writing M1(htr_id=133)::::v=2.4.3 prov=University of Rostock/Institute of Mathematics/CITlab/Tobias Gruening/tobias.gruening@uni-rostock.de:name=de.uros.citlab.module.baseline2polygon.B2PSeamMultiOriented:v=2.4.3 prov=University of Rostock/Institute of Mathematics/CITlab/Tobias Gruening/tobias.gruening@uni-rostock.de:name=/net_tf/LA73_249_0mod360.pb:de.uros.citlab.segmentation.CITlab_LA_ML:v=2.4.2 Transkribus</Creator> <Created>2019-11-25T19:08:42.512Z</Created> <LastChange>2019-11-25T20:54:26.883+01:00</LastChange> </Metadata> <Page imageFilename="HO-40-2_13_14.png" imageWidth="2480" imageHeight="1753"> <ReadingOrder> <OrderedGroup id="ro_1574711666913" caption="Regions reading order"> <RegionRefIndexed index="0" regionRef="r1"/> <RegionRefIndexed index="1" regionRef="r2"/> <RegionRefIndexed index="2" regionRef="r3"/> <RegionRefIndexed index="3" regionRef="r4"/> <RegionRefIndexed index="4" regionRef="r5"/> <RegionRefIndexed index="5" regionRef="r6"/> <RegionRefIndexed index="6" regionRef="r7"/> <RegionRefIndexed index="7" regionRef="r8"/> <RegionRefIndexed index="8" regionRef="r9"/> <RegionRefIndexed index="9" regionRef="r10"/> </OrderedGroup> </ReadingOrder> <TextRegion orientation="0.0" id="r1" custom="readingOrder {index:0;}"> <Coords points="130,1210 130,1364 202,1364 202,1210"/> <TextLine id="r1l1" custom="readingOrder {index:0;}"> <Coords points="130,1364 166,1362 202,1360 202,1210 166,1212 130,1214"/> <Baseline points="130,1314 166,1312 202,1310"/> <TextEquiv> <Unicode>588 B</Unicode> </TextEquiv> </TextLine> <TextEquiv> <Unicode>588 B</Unicode> </TextEquiv> </TextRegion> <TextRegion orientation="0.0" id="r2" custom="readingOrder {index:1;}"> <Coords points="1015,12 1015,1217 1991,1217 1991,12"/> <TextLine id="r2l1" custom="readingOrder {index:0;}"> <Coords points="1015,49 1040,49 1065,49 1090,49 1115,50 1140,50 1166,50 1191,50 1216,50 1241,50 1266,49 1291,49 1317,49 1342,49 1367,49 1392,50 1417,50 1442,50 1468,50 1468,13 1442,13 1417,13 1392,13 1367,12 1342,12 1317,12 1291,12 1266,12 1241,13 1216,13 1191,13 1166,13 1140,13 1115,13 1090,12 1065,12 1040,12 1015,12"/> <Baseline points="1015,37 1040,37 1065,37 1090,37 1115,38 1140,38 1166,38 1191,38 1216,38 1241,38 1266,37 1291,37 1317,37 1342,37 1367,37 1392,38 1417,38 1442,38 1468,38"/> <TextEquiv>
So let's get some co-ordinates...
for line in tree.findall('Page/TextRegion')[1].findall('TextLine')[3:6]:
te = line.find('TextEquiv/Unicode')
if te is not None:
print(te.text, ' ::: ', line.find('Coords').attrib['points'],'\n')
Sir ::: 1225,463 1257,460 1289,461 1289,424 1257,423 1225,426 I beg leave to inform you that I have made ::: 1331,527 1363,526 1396,525 1429,524 1461,522 1494,520 1527,518 1559,516 1592,513 1625,511 1658,509 1690,507 1723,505 1756,503 1788,502 1821,500 1854,499 1886,498 1919,498 1952,498 1985,499 1985,462 1952,461 1919,461 1886,461 1854,462 1821,463 1788,465 1756,466 1723,468 1690,470 1658,472 1625,474 1592,476 1559,479 1527,481 1494,483 1461,485 1429,487 1396,488 1363,489 1331,490 every enquiry respecting the robberies, lately ::: 1230,574 1265,576 1300,578 1335,578 1371,579 1406,579 1441,578 1477,577 1512,576 1547,575 1583,574 1618,572 1653,571 1688,569 1724,568 1759,567 1794,566 1830,565 1865,565 1900,565 1936,566 1936,529 1900,528 1865,528 1830,528 1794,529 1759,530 1724,531 1688,532 1653,534 1618,535 1583,537 1547,538 1512,539 1477,540 1441,541 1406,542 1371,542 1335,541 1300,541 1265,539 1230,537
Using these co-ordinates, can we crop a region of the image corresponding to the text?
How about trying with OpenCV?
import numpy as np
import cv2
img = cv2.imread("testDOwn/HO-40-2_13_14.png")
Preview the image:
from matplotlib import pyplot as plt
plt.imshow(img)
plt.title('Sample Page')
plt.show()
Extract some co-ordinates:
_pts = tree.findall('Page/TextRegion')[1].findall('TextLine')[4].find('Coords').attrib['points']
pts = np.array([[int(_p.split(',')[0]), int(_p.split(',')[1])] for _p in _pts.split()])
pts[:3]
array([[1331, 527], [1363, 526], [1396, 525]])
And then via a handy little script I found on Stack Overflow we can crop out the specified region:
import matplotlib as mpl
def getCroppedArea(img, pts, dpi=300):
"""Get a cropped area from an image.
Via: https://stackoverflow.com/a/48301735/454773"""
_dpi = mpl.rcParams['figure.dpi']
mpl.rcParams['figure.dpi'] = dpi
## (1) Crop the bounding rect
rect = cv2.boundingRect(pts)
x,y,w,h = rect
cropped = img[y:y+h, x:x+w].copy()
## (2) make mask
pts = pts - pts.min(axis=0)
mask = np.zeros(cropped.shape[:2], np.uint8)
cv2.drawContours(mask, [pts], -1, (255, 255, 255), -1, cv2.LINE_AA)
## (3) do bit-op
dst = cv2.bitwise_and(cropped, cropped, mask=mask)
## (4) add the white background
bg = np.ones_like(cropped, np.uint8)*255
cv2.bitwise_not(bg, bg, mask=mask)
cropped_area = bg + dst
return cropped_area
Let's see if it works...
sentence = getCroppedArea(img, pts, 400)
plt.figure(figsize = (15,5))
plt.axis('off')
plt.imshow(sentence)
#Save the image
plt.savefig("test.png", dpi=400)
We can also embed the saved image, of course...
Image("test.png")
This sets up the possible of creating a markdown file, for example, that embeds single line images as well as transcripts. Using Jupytext, such a document can be edited in a notebook editor.
#!pip3 install --upgrade tqdm
import os.path
import hashlib
import pathlib
from tqdm.notebook import tqdm
def markdownFromRegion(region, imgfile, fn='testout.md', outdir='test'):
"""Generate markdown page for Transkribus text region."""
#Make sure the path to the outdir exists...
pathlib.Path(outdir).mkdir(parents=True, exist_ok=True)
md = ''
img = cv2.imread(imgfile)
#Provide a progress bar using tqdm
for line in tqdm(region.findall('TextLine')):
_te = line.find('TextEquiv/Unicode')
_pts = line.find('Coords').attrib['points']
pts = np.array([[int(_p.split(',')[0]), int(_p.split(',')[1])] for _p in _pts.split()])
sentence = getCroppedArea(img, pts, 400)
#Save the image
_img_uid = hashlib.md5(str(pts).encode()).hexdigest()
ifn = f"img_{_img_uid}.png"
cv2.imwrite(os.path.join(outdir, ifn), sentence)
_txt = line.find('TextEquiv/Unicode')
_txt = _txt.text if _txt is not None else ''
md = f'{md}\n\n\n{_txt}\n\n\n'
with open(os.path.join(outdir, fn), 'w') as f:
f.write(md)
Running it seems to be pretty quick...
tree = transkribus_tree('testDown/HO-40-2_13_14.pxml')
markdownFromRegion(tree.findall('Page/TextRegion')[1], "testDOwn/HO-40-2_13_14.png")
HBox(children=(IntProgress(value=0, max=21), HTML(value='')))
Here's an example of what the md raw, and rendered, looks like:
Image("transkribus_testout_md.png")
Let's see if we can create a collection conting a couple of one or two page PDFs, segment the text lines and retrieve the XML and JPG, then segment the JPG into a markdown document.
# Create a collection
#Get one or two pages out of a PDF as a PDF
# Submit the doc
# Segment the Lines
# Get the XML and JPG Back
# Crop each line, saving image fragments
# Insert image link to each fragment into md along with trancscription text