Notebook

Sketch Around Transkribus Python API¶

Transkribus is a desktop application fronted tool for transcribing handwritten texts and generating machine learning models that can then be used to automatically transcribe such texts.

As well as the desktop app, there's also a simple python API for the Transkribus web service [docs and wiki].

The wiki docs are primarily focused on calling the API from the command line.

Whilst we can use CLI calls from %bash cells in a notebook, and so doing documenting (or helping automate) a workflow, I'm more interested in using the package in a Python scripting context.

So what can we do with it?

Installation¶

At the moment, the package is not pip installable, so we need to download the repo and then import the required module from a file:

In [ ]:

%%capture
!wget https://github.com/Transkribus/TranskribusPyClient/archive/master.zip
!unzip master.zip

In [1]:

%cd TranskribusPyClient-master/src
from TranskribusPyClient.client import TranskribusClient
%cd ../..

/Users/tonyhirst/Documents/Luddites/notebooks/TranskribusPyClient-master/src
/Users/tonyhirst/Documents/Luddites

The `TranskribusClient`¶

To get started with the Python Transkribus API client, we need to get an instance of it:

In [2]:

t = TranskribusClient()

The service that client calls is an authenticated one, so we need to supply credentials for it:

In [5]:

from getpass import getpass 
  
user = input('User: ')
pwd = getpass('Password: ')

User: tony.hirst@gmail.com
Password: ········

Logging in updates the state of the client:

In [6]:

t.auth_login(user, pwd)

/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py:857: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)

Out[6]:

True

In [7]:

#Review the methods available from the client
#dir(t)

Many of the client calls require a collection ID, but I can see how to request available collection IDs directly?

We can, however, find client ID values from any recent jobs we've run, such as jobs run via the desktop application:

In [24]:

#We can find recent collection IDs from jobs...
jobs = t.getJobs()

colIds = list({j[k] for j in jobs for k in j if k=='colId'})
colIds_str = ', '.join([str(c) for c in colIds])

print(f'Recent collectionIDs: {colIds_str}')

/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py:857: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)

Recent collectionIDs: 54474

We can then get a list of documents associated with a particular collection:

In [32]:

#How do we get the collectionId?
docs = t.listDocsByCollectionId(colIds[0])

md=''
for doc in docs:
    md = f"{md}\n### docId: {doc['docId']}\n{doc['title']}, {doc['nrOfPages']} pages\n"
    
print(md)

/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py:857: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)

### docId: 268862
Dutch Handwriting 0.1, 1 pages

### docId: 268863
English Handwriting 0.1, 1 pages

### docId: 268864
Wiener Diarium 4.0, 1 pages

### docId: 268865
German Handwriting 0.1, 1 pages

### docId: 268888
luddite_png_test, 1 pages

### docId: 268910
HO-40-1_15, 36 pages

Downloading Transkribus Data Structure XML¶

We can download the XML metadata associated with a document's transcripts either as a parsed lxml.etree document (bParse=True [default]) or as a text string.

In [71]:

col_id = colIds[0]

#doc_id = [d['docId'] for d in docs if d['title']=='English Handwriting 0.1'][0]
doc_id = 268888

In [72]:

#Get XML for a doc
_xml_str = t.getDocByIdAsXml(col_id, doc_id, bParse=False)

/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py:857: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)

If we grab the text string, we can use the convenient xmltodict Python package to convert the XML text to a dict:

In [73]:

#!pip3 install xmltodict

In [74]:

import xmltodict

xml_dict = xmltodict.parse(_xml_str)
xml_dict.keys()

Out[74]:

odict_keys(['trpDoc'])

The document comes in three parts:

metadata (md)
page data [pageList]
collection information [collection]

In [75]:

xml_dict['trpDoc'].keys()

Out[75]:

odict_keys(['md', 'pageList', 'collection'])

In [76]:

xml_dict['trpDoc']['md']

Out[76]:

OrderedDict([('nrOfRegions', '10'),
             ('nrOfTranscribedRegions', '5'),
             ('nrOfWordsInRegions', '69'),
             ('nrOfLines', '31'),
             ('nrOfTranscribedLines', '23'),
             ('nrOfWordsInLines', '87'),
             ('nrOfWords', '0'),
             ('nrOfTranscribedWords', '0'),
             ('nrOfNew', '0'),
             ('nrOfInProgress', '1'),
             ('nrOfDone', '0'),
             ('nrOfFinal', '0'),
             ('nrOfGT', '0'),
             ('docId', '268888'),
             ('title', 'luddite_png_test'),
             ('uploadTimestamp', '1574708953013'),
             ('uploader', 'tony.hirst@gmail.com'),
             ('uploaderId', '51319'),
             ('nrOfPages', '1'),
             ('pageId', '11245035'),
             ('url',
              'https://files.transkribus.eu/Get?id=BKBQUQKQPGATSRFUKPTFMAEM&fileType=view'),
             ('thumbUrl',
              'https://files.transkribus.eu/Get?id=BKBQUQKQPGATSRFUKPTFMAEM&fileType=thumb'),
             ('status', '0'),
             ('fimgStoreColl', 'TrpDoc_DEA_268888'),
             ('origDocId', '0'),
             ('collectionList',
              OrderedDict([('colList',
                            OrderedDict([('colId', '54474'),
                                         ('colName',
                                          'tony.hirst@gmail.com Collection'),
                                         ('description',
                                          'tony.hirst@gmail.com'),
                                         ('crowdsourcing', 'false'),
                                         ('elearning', 'false'),
                                         ('nrOfDocuments', '0')]))]))])

In [82]:

for k in xml_dict['trpDoc']['collection'].keys():
    print(k,xml_dict['trpDoc']['collection'][k])

colId 54474
colName tony.hirst@gmail.com Collection
description tony.hirst@gmail.com
crowdsourcing false
elearning false
nrOfDocuments 0

The pageList structure contains transcript information about each page:

In [77]:

xml_dict['trpDoc']['pageList']['pages']['tsList']['transcripts'][0]

Out[77]:

OrderedDict([('tsId', '19850473'),
             ('parentTsId', '19850334'),
             ('key', 'JZWYSDUSQHFJLVMGHZXSKBLY'),
             ('pageId', '11245035'),
             ('docId', '268888'),
             ('pageNr', '1'),
             ('url',
              'https://files.transkribus.eu/Get?id=JZWYSDUSQHFJLVMGHZXSKBLY'),
             ('status', 'IN_PROGRESS'),
             ('userName', 'tony.hirst@gmail.com'),
             ('userId', '51319'),
             ('timestamp', '1574711666942'),
             ('md5Sum', None),
             ('nrOfRegions', '10'),
             ('nrOfTranscribedRegions', '5'),
             ('nrOfWordsInRegions', '69'),
             ('nrOfLines', '31'),
             ('nrOfTranscribedLines', '23'),
             ('nrOfWordsInLines', '87'),
             ('nrOfWords', '0'),
             ('nrOfTranscribedWords', '0')])

Downloading Transkribus Documents¶

We can download complete Transkribus documents given a collection and document ID.

The complete document contains the coordinates for segmented text regions as well as the transcript for each region Image downloads for each page are enabaled by default (bNoImage=False).

If the specified download folder does not exist, create it. If it does exist, and bForce=False [default] an error is raised; if the argument is true, delete the directory, and recreate an empty one of the same name.

In [10]:

download_dir = 'testDOwn'

In [87]:

t.download_document(col_id, doc_id, download_dir)

/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py:857: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)
/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py:857: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)

/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py:857: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)

Out[87]:

(1574711666942, ['HO-40-2_13_14'])

In [88]:

!ls $download_dir

HO-40-2_13_14.png  HO-40-2_13_14.pxml max.ts             trp.json

The trp.json file looks like it's the Transkribus data structure we could download as the XML file.

(It would be nice if getDocByIdAsXml() were refactored as getDocById() with a switch allowing for xml|json and the bParse flag, when set, returning the etree or a Python dict correspondingly.)

In [81]:

!cat $download_dir/trp.json

{
  "collection": {
    "colId": 54474,
    "colName": "tony.hirst@gmail.com Collection",
    "crowdsourcing": false,
    "description": "tony.hirst@gmail.com",
    "elearning": false,
    "nrOfDocuments": 0
  },
  "edDeclList": [],
  "md": {
    "collectionList": {
      "colList": [
        {
          "colId": 54474,
          "colName": "tony.hirst@gmail.com Collection",
          "crowdsourcing": false,
          "description": "tony.hirst@gmail.com",
          "elearning": false,
          "nrOfDocuments": 0
        }
      ]
    },
    "docId": 268888,
    "fimgStoreColl": "TrpDoc_DEA_268888",
    "nrOfDone": 0,
    "nrOfFinal": 0,
    "nrOfGT": 0,
    "nrOfInProgress": 1,
    "nrOfLines": 31,
    "nrOfNew": 0,
    "nrOfPages": 1,
    "nrOfRegions": 10,
    "nrOfTranscribedLines": 23,
    "nrOfTranscribedRegions": 5,
    "nrOfTranscribedWords": 0,
    "nrOfWords": 0,
    "nrOfWordsInLines": 87,
    "nrOfWordsInRegions": 69,
    "origDocId": 0,
    "pageId": 11245035,
    "status": 0,
    "thumbUrl": "https://files.transkribus.eu/Get?id=BKBQUQKQPGATSRFUKPTFMAEM&fileType=thumb",
    "title": "luddite_png_test",
    "uploadTimestamp": 1574708953013,
    "uploader": "tony.hirst@gmail.com",
    "uploaderId": 51319,
    "url": "https://files.transkribus.eu/Get?id=BKBQUQKQPGATSRFUKPTFMAEM&fileType=view"
  },
  "pageList": {
    "pages": [
      {
        "created": "2019-11-25T20:09:13.309+01:00",
        "docId": 268888,
        "height": 1753,
        "imageId": 7700237,
        "imageVersions": {
          "imageVersions": []
        },
        "imgFileName": "HO-40-2_13_14.png",
        "indexed": false,
        "key": "BKBQUQKQPGATSRFUKPTFMAEM",
        "pageId": 11245035,
        "pageNr": 1,
        "tagsStored": "2019-11-25T20:54:27.341+01:00",
        "thumbUrl": "https://files.transkribus.eu/Get?id=BKBQUQKQPGATSRFUKPTFMAEM&fileType=thumb",
        "tsList": {
          "transcripts": [
            {
              "docId": 268888,
              "key": "JZWYSDUSQHFJLVMGHZXSKBLY",
              "md5Sum": "",
              "nrOfLines": 31,
              "nrOfRegions": 10,
              "nrOfTranscribedLines": 23,
              "nrOfTranscribedRegions": 5,
              "nrOfTranscribedWords": 0,
              "nrOfWords": 0,
              "nrOfWordsInLines": 87,
              "nrOfWordsInRegions": 69,
              "pageId": 11245035,
              "pageNr": 1,
              "parentTsId": 19850334,
              "status": "IN_PROGRESS",
              "timestamp": 1574711666942,
              "tsId": 19850473,
              "url": "https://files.transkribus.eu/Get?id=JZWYSDUSQHFJLVMGHZXSKBLY",
              "userId": 51319,
              "userName": "tony.hirst@gmail.com"
            }
          ]
        },
        "url": "https://files.transkribus.eu/Get?id=BKBQUQKQPGATSRFUKPTFMAEM&fileType=view",
        "width": 2480
      }
    ]
  }
}

The png file is an image file for the page:

In [4]:

from IPython.display import Image
Image('testDOwn/HO-40-2_13_14.png')

Out[4]:

The pxml file contains the line segmentationa and transcript information.

Parsing the XML and Generating a Markdown Document With Individual Text Lines and Transcript¶

The aim here is to see if we can take the XML from a document that has already been segmented and transcribed and process it in some way. For example, crop out each line from the image file and place it in a markdown document, along with the transcription of that line.

We could represent the pxml file as an ordered Python dict using xmltodict, or we could parse it as an XML document. Let's do the latter for now:

In [25]:

import xml.etree.ElementTree as ET
def transkribus_tree(fn):
    with open(fn) as f:
        _xml = f.read()

        #Clean the XML of namespace cruft
        _xml = _xml.replace('xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15 http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15/pagecontent.xsd"','')

        #Parse the XML
        tree = ET.fromstring(_xml)
    return tree

In [28]:

tree = transkribus_tree('testDown/HO-40-2_13_14.pxml')

For each text region, which may contain one or more text lines, each with its own transcript, a complete transcript is also provided:

In [165]:

!head -n 200 testDown/HO-40-2_13_14.pxml | tail -n 30

                    <Unicode>411</Unicode>
                </TextEquiv>
            </TextLine>
            <TextEquiv>
                <Unicode>The National Archives' reference HO 40/2
Halifax 16th October 1812
15
Sir
I beg leave to inform you that I have made
every enquiry respecting the robberies, lately
committed in this neighbourhood; and I am
and I am sorry I have not been able to gain
// any further information than what
I have sent formerly.


Tuesday, the 13th Inst.
I have the honor to be
Sir
Your Most Obedient Servant


Stirling Militia

411</Unicode>
            </TextEquiv>
        </TextRegion>
        <TextRegion orientation="0.0" id="r3" custom="readingOrder {index:2;}">
            <Coords points="2037,383 2037,539 2049,539 2049,383"/>
            <TextEquiv>

We can parse this out as follows:

In [164]:

for t in tree.findall('Page/TextRegion/TextEquiv/Unicode'):
    print('...\n',t.text)

...
 588 B
...
 The National Archives' reference HO 40/2
Halifax 16th October 1812
15
Sir
I beg leave to inform you that I have made
every enquiry respecting the robberies, lately
committed in this neighbourhood; and I am
and I am sorry I have not been able to gain
// any further information than what
I have sent formerly.


Tuesday, the 13th Inst.
I have the honor to be
Sir
Your Most Obedient Servant


Stirling Militia

411
...
 None
...
 3

...
 None
...
 1
2
3
4
...
 None
...
 COPYRIGHT PHOTOGRAPH - NOT TO BE

We can also pull out the co-ordinates defining lines identified within each text region.

Here's what the XML looks like:

In [120]:

!head -n 45 testDown/HO-40-2_13_14.pxml

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15 http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15/pagecontent.xsd">
    <Metadata>
        <Creator>prov=University of Rostock/Institute of Mathematics/CITlab/Tobias Gruening/tobias.gruening@uni-rostock.de:name=/net_tf/LA73_249_0mod360.pb:de.uros.citlab.segmentation.CITlab_LA_ML:v=2.4.2
prov=University of Rostock/Institute of Mathematics/CITlab/Gundram Leifert/gundram.leifert@uni-rostock.de:name=English Writing M1(htr_id=133)::::v=2.4.3
prov=University of Rostock/Institute of Mathematics/CITlab/Tobias Gruening/tobias.gruening@uni-rostock.de:name=de.uros.citlab.module.baseline2polygon.B2PSeamMultiOriented:v=2.4.3
prov=University of Rostock/Institute of Mathematics/CITlab/Tobias Gruening/tobias.gruening@uni-rostock.de:name=/net_tf/LA73_249_0mod360.pb:de.uros.citlab.segmentation.CITlab_LA_ML:v=2.4.2
Transkribus</Creator>
        <Created>2019-11-25T19:08:42.512Z</Created>
        <LastChange>2019-11-25T20:54:26.883+01:00</LastChange>
    </Metadata>
    <Page imageFilename="HO-40-2_13_14.png" imageWidth="2480" imageHeight="1753">
        <ReadingOrder>
            <OrderedGroup id="ro_1574711666913" caption="Regions reading order">
                <RegionRefIndexed index="0" regionRef="r1"/>
                <RegionRefIndexed index="1" regionRef="r2"/>
                <RegionRefIndexed index="2" regionRef="r3"/>
                <RegionRefIndexed index="3" regionRef="r4"/>
                <RegionRefIndexed index="4" regionRef="r5"/>
                <RegionRefIndexed index="5" regionRef="r6"/>
                <RegionRefIndexed index="6" regionRef="r7"/>
                <RegionRefIndexed index="7" regionRef="r8"/>
                <RegionRefIndexed index="8" regionRef="r9"/>
                <RegionRefIndexed index="9" regionRef="r10"/>
            </OrderedGroup>
        </ReadingOrder>
        <TextRegion orientation="0.0" id="r1" custom="readingOrder {index:0;}">
            <Coords points="130,1210 130,1364 202,1364 202,1210"/>
            <TextLine id="r1l1" custom="readingOrder {index:0;}">
                <Coords points="130,1364 166,1362 202,1360 202,1210 166,1212 130,1214"/>
                <Baseline points="130,1314 166,1312 202,1310"/>
                <TextEquiv>
                    <Unicode>588 B</Unicode>
                </TextEquiv>
            </TextLine>
            <TextEquiv>
                <Unicode>588 B</Unicode>
            </TextEquiv>
        </TextRegion>
        <TextRegion orientation="0.0" id="r2" custom="readingOrder {index:1;}">
            <Coords points="1015,12 1015,1217 1991,1217 1991,12"/>
            <TextLine id="r2l1" custom="readingOrder {index:0;}">
                <Coords points="1015,49 1040,49 1065,49 1090,49 1115,50 1140,50 1166,50 1191,50 1216,50 1241,50 1266,49 1291,49 1317,49 1342,49 1367,49 1392,50 1417,50 1442,50 1468,50 1468,13 1442,13 1417,13 1392,13 1367,12 1342,12 1317,12 1291,12 1266,12 1241,13 1216,13 1191,13 1166,13 1140,13 1115,13 1090,12 1065,12 1040,12 1015,12"/>
                <Baseline points="1015,37 1040,37 1065,37 1090,37 1115,38 1140,38 1166,38 1191,38 1216,38 1241,38 1266,37 1291,37 1317,37 1342,37 1367,37 1392,38 1417,38 1442,38 1468,38"/>
                <TextEquiv>

So let's get some co-ordinates...

In [181]:

for line in tree.findall('Page/TextRegion')[1].findall('TextLine')[3:6]:
    te = line.find('TextEquiv/Unicode')
    if te is not None:
        print(te.text, ' ::: ', line.find('Coords').attrib['points'],'\n')

Sir  :::  1225,463 1257,460 1289,461 1289,424 1257,423 1225,426 

I beg leave to inform you that I have made  :::  1331,527 1363,526 1396,525 1429,524 1461,522 1494,520 1527,518 1559,516 1592,513 1625,511 1658,509 1690,507 1723,505 1756,503 1788,502 1821,500 1854,499 1886,498 1919,498 1952,498 1985,499 1985,462 1952,461 1919,461 1886,461 1854,462 1821,463 1788,465 1756,466 1723,468 1690,470 1658,472 1625,474 1592,476 1559,479 1527,481 1494,483 1461,485 1429,487 1396,488 1363,489 1331,490 

every enquiry respecting the robberies, lately  :::  1230,574 1265,576 1300,578 1335,578 1371,579 1406,579 1441,578 1477,577 1512,576 1547,575 1583,574 1618,572 1653,571 1688,569 1724,568 1759,567 1794,566 1830,565 1865,565 1900,565 1936,566 1936,529 1900,528 1865,528 1830,528 1794,529 1759,530 1724,531 1688,532 1653,534 1618,535 1583,537 1547,538 1512,539 1477,540 1441,541 1406,542 1371,542 1335,541 1300,541 1265,539 1230,537

Using these co-ordinates, can we crop a region of the image corresponding to the text?

How about trying with OpenCV?

In [9]:

import numpy as np
import cv2

img = cv2.imread("testDOwn/HO-40-2_13_14.png")

Preview the image:

In [10]:

from matplotlib import pyplot as plt

plt.imshow(img)
plt.title('Sample Page')
plt.show()

Extract some co-ordinates:

In [12]:

_pts =  tree.findall('Page/TextRegion')[1].findall('TextLine')[4].find('Coords').attrib['points']

pts = np.array([[int(_p.split(',')[0]), int(_p.split(',')[1])] for _p in _pts.split()])

pts[:3]

Out[12]:

array([[1331,  527],
       [1363,  526],
       [1396,  525]])

And then via a handy little script I found on Stack Overflow we can crop out the specified region:

In [ ]:

import matplotlib as mpl

def getCroppedArea(img, pts, dpi=300):
    """Get a cropped area from an image.
        Via: https://stackoverflow.com/a/48301735/454773"""
    
    _dpi = mpl.rcParams['figure.dpi']
    mpl.rcParams['figure.dpi'] = dpi
    
    ## (1) Crop the bounding rect
    rect = cv2.boundingRect(pts)
    x,y,w,h = rect
    cropped = img[y:y+h, x:x+w].copy()

    ## (2) make mask
    pts = pts - pts.min(axis=0)

    mask = np.zeros(cropped.shape[:2], np.uint8)
    cv2.drawContours(mask, [pts], -1, (255, 255, 255), -1, cv2.LINE_AA)

    ## (3) do bit-op
    dst = cv2.bitwise_and(cropped, cropped, mask=mask)

    ## (4) add the white background
    bg = np.ones_like(cropped, np.uint8)*255
    cv2.bitwise_not(bg, bg, mask=mask)
    cropped_area = bg + dst
    
    return cropped_area

Let's see if it works...

In [13]:

sentence = getCroppedArea(img, pts, 400)

plt.figure(figsize = (15,5))
plt.axis('off')

plt.imshow(sentence)

#Save the image
plt.savefig("test.png", dpi=400)

We can also embed the saved image, of course...

In [59]:

Image("test.png")

Out[59]:

This sets up the possible of creating a markdown file, for example, that embeds single line images as well as transcripts. Using Jupytext, such a document can be edited in a notebook editor.

In [27]:

#!pip3 install --upgrade tqdm

import os.path
import hashlib
import pathlib
from tqdm.notebook import tqdm

def markdownFromRegion(region, imgfile, fn='testout.md', outdir='test'):
    """Generate markdown page for Transkribus text region."""
    
    #Make sure the path to the outdir exists...
    pathlib.Path(outdir).mkdir(parents=True, exist_ok=True) 
    
    md = ''
    
    img = cv2.imread(imgfile)
    
    #Provide a progress bar using tqdm
    for line in tqdm(region.findall('TextLine')):
        _te = line.find('TextEquiv/Unicode')
        _pts =  line.find('Coords').attrib['points']

        pts = np.array([[int(_p.split(',')[0]), int(_p.split(',')[1])] for _p in _pts.split()])

        sentence = getCroppedArea(img, pts, 400)

        #Save the image
        _img_uid = hashlib.md5(str(pts).encode()).hexdigest()
        ifn = f"img_{_img_uid}.png"
        
        cv2.imwrite(os.path.join(outdir, ifn), sentence) 
        
        _txt = line.find('TextEquiv/Unicode')
        _txt = _txt.text if _txt is not None else ''
        
        md = f'{md}![{_img_uid}]({ifn})\n\n\n{_txt}\n\n\n'
        
        with open(os.path.join(outdir, fn), 'w') as f:
            f.write(md)

Running it seems to be pretty quick...

In [30]:

tree = transkribus_tree('testDown/HO-40-2_13_14.pxml')
markdownFromRegion(tree.findall('Page/TextRegion')[1], "testDOwn/HO-40-2_13_14.png")    

HBox(children=(IntProgress(value=0, max=21), HTML(value='')))

Here's an example of what the md raw, and rendered, looks like:

In [32]:

Image("transkribus_testout_md.png")

Out[32]:

Working with a 1 or 2 page doc on Transkribus¶

Let's see if we can create a collection conting a couple of one or two page PDFs, segment the text lines and retrieve the XML and JPG, then segment the JPG into a markdown document.

In [ ]:

# Create a collection

In [7]:

#Get one or two pages out of a PDF as a PDF

In [8]:

# Submit the doc

In [ ]:

# Segment the Lines

In [9]:

# Get the XML and JPG Back

In [ ]:

# Crop each line, saving image fragments

In [ ]:

# Insert image link to each fragment into md along with trancscription text

Submitting Content Back to the Transkribus Server¶

If we implemented a transciption client via a Jupyter extension, for example, we'd want to be able to submit the transcript back.

The postPageTranscript() method allows us to do that. See the docs for the XML format that we need to return.

In [ ]:

Sketch Around Transkribus Python API¶

Installation¶

The TranskribusClient¶

Downloading Transkribus Data Structure XML¶

Downloading Transkribus Documents¶

Parsing the XML and Generating a Markdown Document With Individual Text Lines and Transcript¶

Working with a 1 or 2 page doc on Transkribus¶

Submitting Content Back to the Transkribus Server¶

The `TranskribusClient`¶