Document AI | Inference at line level with a Document Understanding model (LayoutXLM base fine-tuned on DocLayNet dataset)¶

Credit:
- notebook created from the notebook Fine_tuning_LayoutXLM_on_XFUND_for_token_classification_using_HuggingFace_Trainer.ipynb
- dataset from IBM Research (DocLayNet)
Author of this notebook: Pierre GUILLOU
Date: 03/04/2023
Blog posts:
- Layout XLM base
  - (03/05/2023) Document AI | Inference APP and fine-tuning notebook for Document Understanding at line level with LayoutXLM base
- LiLT base
Notebooks (paragraph level)
- LiLT base
Notebooks (line level)
- Layout XLM base
- LiLT base

Inference at line level¶

LayoutXLM was proposed in LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding by Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Furu Wei.

It is a Document Understanding model that uses both layout and text in order to detect labels of bounding boxes. More, it’s a multilingual extension of the LayoutLMv2 model trained on 53 languages.

It relies on an external OCR engine to get words and bboxes from the document image. Thus, let's run in this APP an OCR engine ourselves (PyTesseract) as we'll need to do it in real life to get the bounding boxes, then run LayoutXLM base (already fine-tuned on the DocLayNet dataset at line level: pierreguillou/lilt-xlm-roberta-base-finetuned-with-DocLayNet-base-at-linelevel-ml384) on the individual tokens and visualize the result at line level!

Install Detectron 2, PDF2image, LangDetect & PyTesseract OCR¶

Detectron 2¶

In [ ]:

!pip install -q torch==1.10.0+cu111 torchvision==0.11+cu111 -f https://download.pytorch.org/whl/torch_stable.html
!python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.1/2.1 GB 778.7 kB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21.8/21.8 MB 33.6 MB/s eta 0:00:00
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchtext 0.14.1 requires torch==1.13.1, but you have torch 1.10.0+cu111 which is incompatible.
torchaudio 0.13.1+cu116 requires torch==1.13.1, but you have torch 1.10.0+cu111 which is incompatible.
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/facebookresearch/detectron2.git
  Cloning https://github.com/facebookresearch/detectron2.git to /tmp/pip-req-build-lzm6opuz
  Running command git clone --filter=blob:none --quiet https://github.com/facebookresearch/detectron2.git /tmp/pip-req-build-lzm6opuz
  Resolved https://github.com/facebookresearch/detectron2.git to commit 5fcee61ce3a25050fc19e73714fdd7bafbefdbbc
  Preparing metadata (setup.py) ... done
Requirement already satisfied: Pillow>=7.1 in /usr/local/lib/python3.9/dist-packages (from detectron2==0.6) (8.4.0)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.9/dist-packages (from detectron2==0.6) (3.5.3)
Requirement already satisfied: pycocotools>=2.0.2 in /usr/local/lib/python3.9/dist-packages (from detectron2==0.6) (2.0.6)
Requirement already satisfied: termcolor>=1.1 in /usr/local/lib/python3.9/dist-packages (from detectron2==0.6) (2.2.0)
Collecting yacs>=0.1.8
  Downloading yacs-0.1.8-py3-none-any.whl (14 kB)
Requirement already satisfied: tabulate in /usr/local/lib/python3.9/dist-packages (from detectron2==0.6) (0.8.10)
Requirement already satisfied: cloudpickle in /usr/local/lib/python3.9/dist-packages (from detectron2==0.6) (2.2.1)
Requirement already satisfied: tqdm>4.29.0 in /usr/local/lib/python3.9/dist-packages (from detectron2==0.6) (4.65.0)
Requirement already satisfied: tensorboard in /usr/local/lib/python3.9/dist-packages (from detectron2==0.6) (2.11.2)
Collecting fvcore<0.1.6,>=0.1.5
  Downloading fvcore-0.1.5.post20221221.tar.gz (50 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 50.2/50.2 KB 5.3 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
Collecting iopath<0.1.10,>=0.1.7
  Downloading iopath-0.1.9-py3-none-any.whl (27 kB)
Collecting omegaconf>=2.1
  Downloading omegaconf-2.3.0-py3-none-any.whl (79 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 79.5/79.5 KB 7.2 MB/s eta 0:00:00
Collecting hydra-core>=1.1
  Downloading hydra_core-1.3.2-py3-none-any.whl (154 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 154.5/154.5 KB 15.8 MB/s eta 0:00:00
Collecting black
  Downloading black-23.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 47.0 MB/s eta 0:00:00
Requirement already satisfied: packaging in /usr/local/lib/python3.9/dist-packages (from detectron2==0.6) (23.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.9/dist-packages (from fvcore<0.1.6,>=0.1.5->detectron2==0.6) (1.22.4)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.9/dist-packages (from fvcore<0.1.6,>=0.1.5->detectron2==0.6) (6.0)
Collecting antlr4-python3-runtime==4.9.*
  Downloading antlr4-python3-runtime-4.9.3.tar.gz (117 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 117.0/117.0 KB 14.9 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
Collecting portalocker
  Downloading portalocker-2.7.0-py2.py3-none-any.whl (15 kB)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.9/dist-packages (from matplotlib->detectron2==0.6) (4.39.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.9/dist-packages (from matplotlib->detectron2==0.6) (1.4.4)
Requirement already satisfied: pyparsing>=2.2.1 in /usr/local/lib/python3.9/dist-packages (from matplotlib->detectron2==0.6) (3.0.9)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.9/dist-packages (from matplotlib->detectron2==0.6) (0.11.0)
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.9/dist-packages (from matplotlib->detectron2==0.6) (2.8.2)
Requirement already satisfied: platformdirs>=2 in /usr/local/lib/python3.9/dist-packages (from black->detectron2==0.6) (3.1.0)
Collecting pathspec>=0.9.0
  Downloading pathspec-0.11.0-py3-none-any.whl (29 kB)
Collecting mypy-extensions>=0.4.3
  Downloading mypy_extensions-1.0.0-py3-none-any.whl (4.7 kB)
Requirement already satisfied: tomli>=1.1.0 in /usr/local/lib/python3.9/dist-packages (from black->detectron2==0.6) (2.0.1)
Requirement already satisfied: click>=8.0.0 in /usr/local/lib/python3.9/dist-packages (from black->detectron2==0.6) (8.1.3)
Requirement already satisfied: typing-extensions>=3.10.0.0 in /usr/local/lib/python3.9/dist-packages (from black->detectron2==0.6) (4.5.0)
Requirement already satisfied: grpcio>=1.24.3 in /usr/local/lib/python3.9/dist-packages (from tensorboard->detectron2==0.6) (1.51.3)
Requirement already satisfied: requests<3,>=2.21.0 in /usr/local/lib/python3.9/dist-packages (from tensorboard->detectron2==0.6) (2.25.1)
Requirement already satisfied: werkzeug>=1.0.1 in /usr/local/lib/python3.9/dist-packages (from tensorboard->detectron2==0.6) (2.2.3)
Requirement already satisfied: tensorboard-plugin-wit>=1.6.0 in /usr/local/lib/python3.9/dist-packages (from tensorboard->detectron2==0.6) (1.8.1)
Requirement already satisfied: markdown>=2.6.8 in /usr/local/lib/python3.9/dist-packages (from tensorboard->detectron2==0.6) (3.4.1)
Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in /usr/local/lib/python3.9/dist-packages (from tensorboard->detectron2==0.6) (0.4.6)
Requirement already satisfied: tensorboard-data-server<0.7.0,>=0.6.0 in /usr/local/lib/python3.9/dist-packages (from tensorboard->detectron2==0.6) (0.6.1)
Requirement already satisfied: wheel>=0.26 in /usr/local/lib/python3.9/dist-packages (from tensorboard->detectron2==0.6) (0.38.4)
Requirement already satisfied: setuptools>=41.0.0 in /usr/local/lib/python3.9/dist-packages (from tensorboard->detectron2==0.6) (57.4.0)
Requirement already satisfied: absl-py>=0.4 in /usr/local/lib/python3.9/dist-packages (from tensorboard->detectron2==0.6) (1.4.0)
Requirement already satisfied: protobuf<4,>=3.9.2 in /usr/local/lib/python3.9/dist-packages (from tensorboard->detectron2==0.6) (3.19.6)
Requirement already satisfied: google-auth<3,>=1.6.3 in /usr/local/lib/python3.9/dist-packages (from tensorboard->detectron2==0.6) (2.16.2)
Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.9/dist-packages (from google-auth<3,>=1.6.3->tensorboard->detectron2==0.6) (1.15.0)
Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.9/dist-packages (from google-auth<3,>=1.6.3->tensorboard->detectron2==0.6) (0.2.8)
Requirement already satisfied: rsa<5,>=3.1.4 in /usr/local/lib/python3.9/dist-packages (from google-auth<3,>=1.6.3->tensorboard->detectron2==0.6) (4.9)
Requirement already satisfied: cachetools<6.0,>=2.0.0 in /usr/local/lib/python3.9/dist-packages (from google-auth<3,>=1.6.3->tensorboard->detectron2==0.6) (5.3.0)
Requirement already satisfied: requests-oauthlib>=0.7.0 in /usr/local/lib/python3.9/dist-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard->detectron2==0.6) (1.3.1)
Requirement already satisfied: importlib-metadata>=4.4 in /usr/local/lib/python3.9/dist-packages (from markdown>=2.6.8->tensorboard->detectron2==0.6) (6.0.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.9/dist-packages (from requests<3,>=2.21.0->tensorboard->detectron2==0.6) (2022.12.7)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.9/dist-packages (from requests<3,>=2.21.0->tensorboard->detectron2==0.6) (1.26.14)
Requirement already satisfied: chardet<5,>=3.0.2 in /usr/local/lib/python3.9/dist-packages (from requests<3,>=2.21.0->tensorboard->detectron2==0.6) (4.0.0)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.9/dist-packages (from requests<3,>=2.21.0->tensorboard->detectron2==0.6) (2.10)
Requirement already satisfied: MarkupSafe>=2.1.1 in /usr/local/lib/python3.9/dist-packages (from werkzeug>=1.0.1->tensorboard->detectron2==0.6) (2.1.2)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.9/dist-packages (from importlib-metadata>=4.4->markdown>=2.6.8->tensorboard->detectron2==0.6) (3.15.0)
Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in /usr/local/lib/python3.9/dist-packages (from pyasn1-modules>=0.2.1->google-auth<3,>=1.6.3->tensorboard->detectron2==0.6) (0.4.8)
Requirement already satisfied: oauthlib>=3.0.0 in /usr/local/lib/python3.9/dist-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard->detectron2==0.6) (3.2.2)
Building wheels for collected packages: detectron2, fvcore, antlr4-python3-runtime
  Building wheel for detectron2 (setup.py) ... done
  Created wheel for detectron2: filename=detectron2-0.6-cp39-cp39-linux_x86_64.whl size=6192087 sha256=28fd41c532b394eee4055961181ca491e142d68d758b127693ac4c28a6b20c7a
  Stored in directory: /tmp/pip-ephem-wheel-cache-ks0z7o3j/wheels/59/b4/83/84bfca751fa4dcc59998468be8688eb50e97408a83af171d42
  Building wheel for fvcore (setup.py) ... done
  Created wheel for fvcore: filename=fvcore-0.1.5.post20221221-py3-none-any.whl size=61431 sha256=e8a9ec738c37a3f5126d0b8250e4c11b28b6a220c357baaaa782ce1f2c72d2bf
  Stored in directory: /root/.cache/pip/wheels/83/42/02/66178d16e5c44dc26d309931834956baeda371956e86fbd876
  Building wheel for antlr4-python3-runtime (setup.py) ... done
  Created wheel for antlr4-python3-runtime: filename=antlr4_python3_runtime-4.9.3-py3-none-any.whl size=144575 sha256=6c67831dac6e53fb9e689c5a064db022016ed7d0c30172861b7511ce439e81e2
  Stored in directory: /root/.cache/pip/wheels/23/cf/80/f3efa822e6ab23277902ee9165fe772eeb1dfb8014f359020a
Successfully built detectron2 fvcore antlr4-python3-runtime
Installing collected packages: antlr4-python3-runtime, yacs, portalocker, pathspec, omegaconf, mypy-extensions, iopath, hydra-core, black, fvcore, detectron2
Successfully installed antlr4-python3-runtime-4.9.3 black-23.1.0 detectron2-0.6 fvcore-0.1.5.post20221221 hydra-core-1.3.2 iopath-0.1.9 mypy-extensions-1.0.0 omegaconf-2.3.0 pathspec-0.11.0 portalocker-2.7.0 yacs-0.1.8

PDF2image¶

In [ ]:

%%capture
!apt-get install poppler-utils
!pip install pdf2image

LangDetect¶

In [ ]:

!pip install -q langdetect

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 981.5/981.5 KB 23.2 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
  Building wheel for langdetect (setup.py) ... done

Tesseract OCR¶

In [ ]:

%%capture
!sudo apt install tesseract-ocr-all # english + osd (Orientation and script detection module)
# !sudo apt-get install tesseract-ocr-por # portuguese

import os
print(os.popen(f'cat /etc/debian_version').read())
print(os.popen(f'cat /etc/issue').read())
print(os.popen(f'apt search tesseract').read())

!pip install pytesseract

Kill runtime¶

In Colab, it is needed in order to update libraries with their new installed version (pillow).

In [ ]:

import os
os.kill(os.getpid(), 9)

Set-up environment¶

Libraries¶

In [ ]:

!pip install -q transformers sentencepiece datasets gradio pypdf

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 45.7 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 59.3 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 469.0/469.0 KB 36.9 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.3/14.3 MB 44.3 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 244.3/244.3 KB 26.0 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 199.1/199.1 KB 21.0 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.6/7.6 MB 66.1 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 110.5/110.5 KB 7.1 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 212.2/212.2 KB 19.6 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 132.9/132.9 KB 13.7 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.0/1.0 MB 43.9 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 106.5/106.5 KB 12.0 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 140.9/140.9 KB 15.0 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.9/56.9 KB 6.1 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 71.5/71.5 KB 7.0 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 84.5/84.5 KB 7.1 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 50.5/50.5 KB 4.9 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.1/2.1 MB 53.0 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 45.7/45.7 KB 4.3 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.3/56.3 KB 4.6 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 114.2/114.2 KB 11.9 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 158.8/158.8 KB 15.9 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 264.6/264.6 KB 24.6 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 199.2/199.2 KB 20.8 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 66.4/66.4 KB 6.8 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 69.6/69.6 KB 5.9 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 58.3/58.3 KB 5.1 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 80.6/80.6 KB 7.2 MB/s eta 0:00:00
  Building wheel for ffmpy (setup.py) ... done

In [ ]:

import os 
from operator import itemgetter
import collections
import string
import re

import pypdf
from pypdf import PdfReader
from pypdf.errors import PdfReadError

import pdf2image
from pdf2image import convert_from_path
import langdetect
from langdetect import detect_langs
import pytesseract

import pandas as pd
import numpy as np
import random
from google.colab import files

import tempfile
from matplotlib import font_manager
from PIL import Image, ImageDraw, ImageFont
import cv2
# In Colab, use cv2_imshow instead of cv2.imshow
from google.colab.patches import cv2_imshow
from IPython.display import display

import itertools
import gradio as gr

import pathlib
from pathlib import Path
import shutil

Key parameters¶

In [ ]:

# categories colors
label2color = {
    'Caption': 'brown',
    'Footnote': 'orange',
    'Formula': 'gray',
    'List-item': 'yellow',
    'Page-footer': 'red',
    'Page-header': 'red',
    'Picture': 'violet',
    'Section-header': 'orange',
    'Table': 'green',
    'Text': 'blue',
    'Title': 'pink'
    }

# bounding boxes start and end of a sequence
cls_box = [0, 0, 0, 0]
sep_box = [1000, 1000, 1000, 1000]

# model
model_id = "pierreguillou/layout-xlm-base-finetuned-with-DocLayNet-base-at-linelevel-ml384"

# tokenizer
tokenizer_id = "xlm-roberta-base"

# (tokenization) The maximum length of a feature (sequence)
if str(384) in model_id:
  max_length = 384 
elif str(512) in model_id:
  max_length = 512 
else:
  print("Error with max_length of chunks!")

# (tokenization) overlap
doc_stride = 128 # The authorized overlap between two part of the context when splitting it is needed.

# max PDF page images that will be displayed
max_imgboxes = 2

# get files
examples_dir = 'files/'
Path(examples_dir).mkdir(parents=True, exist_ok=True)
from huggingface_hub import hf_hub_download
files = ["example.pdf", "blank.pdf", "blank.png", "languages_iso.csv", "languages_tesseract.csv", "wo_content.png"]
for file_name in files:
    path_to_file = hf_hub_download(
        repo_id = "pierreguillou/Inference-APP-Document-Understanding-at-linelevel-v2",
        filename = "files/" + file_name,
        repo_type = "space"
        )
    shutil.copy(path_to_file,examples_dir)

# path to files
image_wo_content = examples_dir + "wo_content.png" # image without content
pdf_blank = examples_dir + "blank.pdf" # blank PDF
image_blank = examples_dir + "blank.png" # blank image

## get langdetect2Tesseract dictionary
t = "files/languages_tesseract.csv"
l = "files/languages_iso.csv"

df_t = pd.read_csv(t)
df_l = pd.read_csv(l)

langs_t = df_t["Language"].to_list()
langs_t = [lang_t.lower().strip().translate(str.maketrans('', '', string.punctuation)) for lang_t in langs_t]
langs_l = df_l["Language"].to_list()
langs_l = [lang_l.lower().strip().translate(str.maketrans('', '', string.punctuation)) for lang_l in langs_l]
langscode_t = df_t["LangCode"].to_list()
langscode_l = df_l["LangCode"].to_list()

Tesseract2langdetect, langdetect2Tesseract = dict(), dict()
for lang_t, langcode_t in zip(langs_t,langscode_t):
  try:
    if lang_t == "Chinese - Simplified".lower().strip().translate(str.maketrans('', '', string.punctuation)): lang_t = "chinese"
    index = langs_l.index(lang_t)
    langcode_l = langscode_l[index]
    Tesseract2langdetect[langcode_t] = langcode_l
  except: 
    continue

langdetect2Tesseract = {v:k for k,v in Tesseract2langdetect.items()}

Downloading (…)in/files/example.pdf:   0%|          | 0.00/343k [00:00<?, ?B/s]

Downloading (…)main/files/blank.pdf:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

Downloading (…)main/files/blank.png:   0%|          | 0.00/3.99k [00:00<?, ?B/s]

Downloading (…)es/languages_iso.csv:   0%|          | 0.00/2.52k [00:00<?, ?B/s]

Downloading (…)guages_tesseract.csv:   0%|          | 0.00/2.13k [00:00<?, ?B/s]

Downloading (…)files/wo_content.png:   0%|          | 0.00/27.6k [00:00<?, ?B/s]

Functions¶

General¶

In [ ]:

# get text and bounding boxes from an image
# https://stackoverflow.com/questions/61347755/how-can-i-get-line-coordinates-that-readed-by-tesseract
# https://medium.com/geekculture/tesseract-ocr-understanding-the-contents-of-documents-beyond-their-text-a98704b7c655
def get_data(results, factor, conf_min=0):

  data = {}
  for i in range(len(results['line_num'])):
    level = results['level'][i]
    block_num = results['block_num'][i]
    par_num = results['par_num'][i]
    line_num = results['line_num'][i]
    top, left = results['top'][i], results['left'][i]
    width, height = results['width'][i], results['height'][i]
    conf = results['conf'][i]
    text = results['text'][i]
    if not (text == '' or text.isspace()):
      if conf >= conf_min:
        tup = (text, left, top, width, height)
        if block_num in list(data.keys()):
          if par_num in list(data[block_num].keys()):
            if line_num in list(data[block_num][par_num].keys()):
              data[block_num][par_num][line_num].append(tup)
            else:
              data[block_num][par_num][line_num] = [tup]
          else:
            data[block_num][par_num] = {}
            data[block_num][par_num][line_num] = [tup]
        else:
            data[block_num] = {}
            data[block_num][par_num] = {}
            data[block_num][par_num][line_num] = [tup]

  # get paragraphs dicionnary with list of lines
  par_data = {}
  par_idx = 1
  for _, b  in data.items():
    for _, p in b.items():
      line_data = {}
      line_idx = 1
      for _, l in p.items():
        line_data[line_idx] = l
        line_idx += 1
      par_data[par_idx] = line_data 
      par_idx += 1

  # get lines of texts, grouped by paragraph
  lines = list()
  row_indexes = list()
  row_index = 0
  for _,par in par_data.items():
    count_lines = 0
    for _,line in par.items():
      if count_lines == 0: row_indexes.append(row_index)
      line_text = ' '.join([item[0] for item in line])
      lines.append(line_text)
      count_lines += 1
      row_index += 1
    # lines.append("\n")
    row_index += 1
  # lines = lines[:-1]
  
  # get paragraphes boxes (par_boxes)
  # get lines boxes (line_boxes)
  par_boxes = list()
  par_idx = 1
  line_boxes = list()
  line_idx = 1
  for _, par in par_data.items():
    xmins, ymins, xmaxs, ymaxs = list(), list(), list(), list()
    for _, line in par.items():
      xmin, ymin = line[0][1], line[0][2]
      xmax, ymax = (line[-1][1] + line[-1][3]), (line[-1][2] + line[-1][4])
      line_boxes.append([int(xmin/factor), int(ymin/factor), int(xmax/factor), int(ymax/factor)])
      xmins.append(xmin)
      ymins.append(ymin)
      xmaxs.append(xmax)
      ymaxs.append(ymax)
      line_idx += 1
    xmin, ymin, xmax, ymax = min(xmins), min(ymins), max(xmaxs), max(ymaxs)
    par_boxes.append([int(xmin/factor), int(ymin/factor), int(xmax/factor), int(ymax/factor)])
    par_idx += 1

  return lines, row_indexes, par_boxes, line_boxes #data, par_data #

# rescale image to get 300dpi
def set_image_dpi_resize(image):
    """
    Rescaling image to 300dpi while resizing
    :param image: An image
    :return: A rescaled image
    """
    length_x, width_y = image.size
    factor = min(1, float(1024.0 / length_x))
    size = int(factor * length_x), int(factor * width_y)
    # image_resize = image.resize(size, Image.Resampling.LANCZOS)
    image_resize = image.resize(size, Image.LANCZOS)
    temp_file = tempfile.NamedTemporaryFile(delete=False, suffix='1.png')
    temp_filename = temp_file.name
    image_resize.save(temp_filename, dpi=(300, 300))
    return factor, temp_filename

In [ ]:

# it is important that each bounding box should be in (upper left, lower right) format.
# source: https://github.com/NielsRogge/Transformers-Tutorials/issues/129
def upperleft_to_lowerright(bbox):
  x0, y0, x1, y1 = tuple(bbox)
  if bbox[2] < bbox[0]:
    x0 = bbox[2]
    x1 = bbox[0] 
  if bbox[3] < bbox[1]:
    y0 = bbox[3]
    y1 = bbox[1] 
  return [x0, y0, x1, y1]

# convert boundings boxes (left, top, width, height) format to (left, top, left+widght, top+height) format. 
def convert_box(bbox):
    x, y, w, h = tuple(bbox) # the row comes in (left, top, width, height) format
    return [x, y, x+w, y+h] # we turn it into (left, top, left+widght, top+height) to get the actual box 

# LiLT model gets 1000x10000 pixels images
def normalize_box(bbox, width, height):
    return [
        int(1000 * (bbox[0] / width)),
        int(1000 * (bbox[1] / height)),
        int(1000 * (bbox[2] / width)),
        int(1000 * (bbox[3] / height)),
    ]

# LiLT model gets 1000x10000 pixels images
def denormalize_box(bbox, width, height):
    return [
        int(width * (bbox[0] / 1000)),
        int(height * (bbox[1] / 1000)),
        int(width* (bbox[2] / 1000)),
        int(height * (bbox[3] / 1000)),
    ]

# get back original size
def original_box(box, original_width, original_height, coco_width, coco_height):
    return [
        int(original_width * (box[0] / coco_width)),
        int(original_height * (box[1] / coco_height)),
        int(original_width * (box[2] / coco_width)),
        int(original_height* (box[3] / coco_height)),
    ]

def get_blocks(bboxes_block, categories, texts):

 # get list of unique block boxes
    bbox_block_dict, bboxes_block_list, bbox_block_prec = dict(), list(), list()
    for count_block, bbox_block in enumerate(bboxes_block):
      if bbox_block != bbox_block_prec:
        bbox_block_indexes = [i for i, bbox in enumerate(bboxes_block) if bbox == bbox_block]
        bbox_block_dict[count_block] = bbox_block_indexes
        bboxes_block_list.append(bbox_block)
      bbox_block_prec = bbox_block

    # get list of categories and texts by unique block boxes
    category_block_list, text_block_list = list(), list()
    for bbox_block in bboxes_block_list:
      count_block = bboxes_block.index(bbox_block)
      bbox_block_indexes = bbox_block_dict[count_block]
      category_block = np.array(categories, dtype=object)[bbox_block_indexes].tolist()[0]
      category_block_list.append(category_block)
      text_block = np.array(texts, dtype=object)[bbox_block_indexes].tolist()
      text_block = [text.replace("\n","").strip() for text in text_block]
      if id2label[category_block] == "Text" or id2label[category_block] == "Caption" or id2label[category_block] == "Footnote":
        text_block = ' '.join(text_block)
      else:
        text_block = '\n'.join(text_block)
      text_block_list.append(text_block)

    return bboxes_block_list, category_block_list, text_block_list

# function to sort bounding boxes
def get_sorted_boxes(bboxes):

  # sort by y from page top to bottom 
  sorted_bboxes = sorted(bboxes, key=itemgetter(1), reverse=False)
  y_list = [bbox[1] for bbox in sorted_bboxes]

  # sort by x from page left to right when boxes with same y
  if len(list(set(y_list))) != len(y_list):
    y_list_duplicates_indexes = dict()
    y_list_duplicates = [item for item, count in collections.Counter(y_list).items() if count > 1]
    for item in y_list_duplicates:
      y_list_duplicates_indexes[item] = [i for i, e in enumerate(y_list) if e == item]
      bbox_list_y_duplicates = sorted(np.array(sorted_bboxes, dtype=object)[y_list_duplicates_indexes[item]].tolist(), key=itemgetter(0), reverse=False)
      np_array_bboxes = np.array(sorted_bboxes)
      np_array_bboxes[y_list_duplicates_indexes[item]] = np.array(bbox_list_y_duplicates)
      sorted_bboxes = np_array_bboxes.tolist()

  return sorted_bboxes

# sort data from y = 0 to end of page (and after, x=0 to end of page when necessary)
def sort_data(bboxes, categories, texts):

    sorted_bboxes = get_sorted_boxes(bboxes)
    sorted_bboxes_indexes = [bboxes.index(bbox) for bbox in sorted_bboxes]
    sorted_categories = np.array(categories, dtype=object)[sorted_bboxes_indexes].tolist()
    sorted_texts = np.array(texts, dtype=object)[sorted_bboxes_indexes].tolist()

    return sorted_bboxes, sorted_categories, sorted_texts

# sort data from y = 0 to end of page (and after, x=0 to end of page when necessary)
def sort_data_wo_labels(bboxes, texts):

    sorted_bboxes = get_sorted_boxes(bboxes)
    sorted_bboxes_indexes = [bboxes.index(bbox) for bbox in sorted_bboxes]
    sorted_texts = np.array(texts, dtype=object)[sorted_bboxes_indexes].tolist()

    return sorted_bboxes, sorted_texts

PDF processing¶

In [ ]:

# get filename and images of PDF pages
def pdf_to_images(uploaded_pdf):

    # file name of the uploaded PDF
    filename = next(iter(uploaded_pdf))

    try:
        PdfReader(filename)
    except PdfReadError:
        print("Invalid PDF file.")
    else:
        try:
            images = convert_from_path(str(filename))
            num_imgs = len(images)
            print(f'The PDF "{filename}"" was converted into {num_imgs} images.')
            print("Now, you can extract data from theses images (text, bounding boxes...).")
        except:
            print(f"Error with the PDF {filename}:it was not converted into images.")
            print()
        else:
            # display images
            if num_imgs > 0:

                import matplotlib.pyplot as plt
                %matplotlib inline

                plt.figure(figsize=(20,10))
                columns = 5
                for i, image in enumerate(images):
                    plt.subplot(int(num_imgs / columns + 1), columns, i + 1)
                    plt.xticks(color="white")
                    plt.yticks(color="white")
                    plt.tick_params(bottom = False)
                    plt.tick_params(left = False)
                    plt.imshow(image)

                return filename, images

In [ ]:

# Extraction of image data (text and bounding boxes)
def extraction_data_from_image(images):

    num_imgs = len(images)

    if num_imgs > 0:

        # https://pyimagesearch.com/2021/11/15/tesseract-page-segmentation-modes-psms-explained-how-to-improve-your-ocr-accuracy/
        custom_config = r'--oem 3 --psm 3 -l eng' # default config PyTesseract: --oem 3 --psm 3 -l eng+deu+fra+jpn+por+spa+rus+hin+chi_sim
        results, lines, row_indexes, par_boxes, line_boxes, images_pixels = dict(), dict(), dict(), dict(), dict(), dict()
        images_ids_list, lines_list, par_boxes_list, line_boxes_list, images_list, images_pixels_list, page_no_list, num_pages_list = list(), list(), list(), list(), list(), list(), list(), list()
        
        try: 
            for i,image in enumerate(images):
                # image preprocessing
                # https://docs.opencv.org/3.0-beta/doc/py_tutorials/py_imgproc/py_thresholding/py_thresholding.html
                img = image.copy()
                factor, path_to_img = set_image_dpi_resize(img) # Rescaling to 300dpi
                img = Image.open(path_to_img)
                img = np.array(img, dtype='uint8') # convert PIL to cv2
                img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) # gray scale image
                ret,img = cv2.threshold(img,127,255,cv2.THRESH_BINARY)

                # OCR PyTesseract | get langs of page
                txt = pytesseract.image_to_string(img, config=custom_config)
                txt = txt.strip().lower()
                txt = re.sub(r" +", " ", txt) # multiple space
                txt = re.sub(r"(\n\s*)+\n+", "\n", txt) # multiple line
                # txt = os.popen(f'tesseract {img_filepath} - {custom_config}').read()
                try:
                    langs = detect_langs(txt)
                    langs = [langdetect2Tesseract[langs[i].lang] for i in range(len(langs))]
                    langs_string = '+'.join(langs)
                except:
                    langs_string = "eng"
                langs_string += '+osd'
                custom_config = f'--oem 3 --psm 3 -l {langs_string}' # default config PyTesseract: --oem 3 --psm 3

                # OCR PyTesseract | get data
                results[i] = pytesseract.image_to_data(img, config=custom_config, output_type=pytesseract.Output.DICT)
                # results[i] = os.popen(f'tesseract {img_filepath} - {custom_config}').read()

                # get image pixels
                images_pixels[i] = feature_extractor(images[i], return_tensors="pt").pixel_values

                lines[i], row_indexes[i], par_boxes[i], line_boxes[i] = get_data(results[i], factor, conf_min=0)
                lines_list.append(lines[i])
                par_boxes_list.append(par_boxes[i])
                line_boxes_list.append(line_boxes[i])
                images_ids_list.append(i)
                images_pixels_list.append(images_pixels[i])
                images_list.append(images[i])
                page_no_list.append(i)
                num_pages_list.append(num_imgs)

        except:
            print(f"There was an error within the extraction of PDF text by the OCR!")
        else: 
            from datasets import Dataset
            dataset = Dataset.from_dict({"images_ids": images_ids_list, "images": images_list, "images_pixels": images_pixels_list, "page_no": page_no_list, "num_pages": num_pages_list, "texts": lines_list, "bboxes_line": line_boxes_list})
                    
            # print(f"The text data was successfully extracted by the OCR!")

            return dataset, lines, row_indexes, par_boxes, line_boxes

Inference¶

In [ ]:

def prepare_inference_features(example, cls_box = cls_box, sep_box = sep_box):

  images_ids_list, chunks_ids_list, input_ids_list, attention_mask_list, bb_list, images_pixels_list = list(), list(), list(), list(), list(), list()

  # get batch 
  batch_images_ids = example["images_ids"]
  batch_images = example["images"]
  batch_images_pixels = example["images_pixels"]
  batch_bboxes_line = example["bboxes_line"]
  batch_texts = example["texts"]  
  batch_images_size = [image.size for image in batch_images]

  batch_width, batch_height = [image_size[0] for image_size in batch_images_size], [image_size[1] for image_size in batch_images_size]

  # add a dimension if not a batch but only one image
  if not isinstance(batch_images_ids, list): 
    batch_images_ids = [batch_images_ids]
    batch_images = [batch_images]
    batch_images_pixels = [batch_images_pixels]
    batch_bboxes_line = [batch_bboxes_line]
    batch_texts = [batch_texts]
    batch_width, batch_height = [batch_width], [batch_height] 

  # process all images of the batch
  for num_batch, (image_id, image_pixels, boxes, texts, width, height) in enumerate(zip(batch_images_ids, batch_images_pixels, batch_bboxes_line, batch_texts, batch_width, batch_height)):
    tokens_list = []
    bboxes_list = []

    # add a dimension if only on image
    if not isinstance(texts, list):
      texts, boxes = [texts], [boxes]

    # convert boxes to original
    normalize_bboxes_line = [normalize_box(upperleft_to_lowerright(box), width, height) for box in boxes]

    # sort boxes with texts
    # we want sorted lists from top to bottom of the image
    boxes, texts = sort_data_wo_labels(normalize_bboxes_line, texts)

    count = 0
    for box, text in zip(boxes, texts):
      tokens = tokenizer.tokenize(text)
      num_tokens = len(tokens) # get number of tokens
      tokens_list.extend(tokens)

      bboxes_list.extend([box] * num_tokens) # number of boxes must be the same as the number of tokens

    # use of return_overflowing_tokens=True / stride=doc_stride
    # to get parts of image with overlap
    # source: https://huggingface.co/course/chapter6/3b?fw=tf#handling-long-contexts
    encodings = tokenizer(" ".join(texts), 
                          truncation=True,
                          padding="max_length", 
                          max_length=max_length, 
                          stride=doc_stride, 
                          return_overflowing_tokens=True, 
                          return_offsets_mapping=True
                          )

    otsm = encodings.pop("overflow_to_sample_mapping")
    offset_mapping = encodings.pop("offset_mapping")

    # Let's label those examples and get their boxes   
    sequence_length_prev = 0   
    for i, offsets in enumerate(offset_mapping):
      # truncate tokens, boxes and labels based on length of chunk - 2 (special tokens <s> and </s>)
      sequence_length = len(encodings.input_ids[i]) - 2
      if i == 0: start = 0
      else: start += sequence_length_prev - doc_stride
      end = start + sequence_length
      sequence_length_prev = sequence_length

      # get tokens, boxes and labels of this image chunk
      bb = [cls_box] + bboxes_list[start:end] + [sep_box]

      # as the last chunk can have a length < max_length
      # we must to add [tokenizer.pad_token] (tokens), [sep_box] (boxes) and [-100] (labels)
      if len(bb) < max_length:
        bb = bb + [sep_box] * (max_length - len(bb))

      # append results
      input_ids_list.append(encodings["input_ids"][i])
      attention_mask_list.append(encodings["attention_mask"][i])
      bb_list.append(bb)
      images_ids_list.append(image_id)
      chunks_ids_list.append(i)
      images_pixels_list.append(image_pixels)
        
  return {
      "images_ids": images_ids_list,
      "chunk_ids": chunks_ids_list,
      "input_ids": input_ids_list,
      "attention_mask": attention_mask_list,
      "normalized_bboxes": bb_list,
      "images_pixels": images_pixels_list
  }

In [ ]:

from torch.utils.data import Dataset

class CustomDataset(Dataset):
  def __init__(self, dataset, tokenizer):
    self.dataset = dataset
    self.tokenizer = tokenizer

  def __len__(self):
    return len(self.dataset)

  def __getitem__(self, idx):
    # get item
    example = self.dataset[idx]
    encoding = dict()
    encoding["images_ids"] = example["images_ids"]
    encoding["chunk_ids"] = example["chunk_ids"]
    encoding["input_ids"] = example["input_ids"]
    encoding["attention_mask"] = example["attention_mask"]
    encoding["bbox"] = example["normalized_bboxes"]
    encoding["images_pixels"] = example["images_pixels"]
    
    return encoding

In [ ]:

import torch.nn.functional as F

# get predictions at token level
def predictions_token_level(images, custom_encoded_dataset):

    num_imgs = len(images)
    if num_imgs > 0:

        chunk_ids, input_ids, bboxes, pixels_values, outputs, token_predictions  = dict(), dict(), dict(), dict(), dict(), dict()
        images_ids_list = list()

        for i,encoding in enumerate(custom_encoded_dataset):

            # get custom encoded data
            image_id = encoding['images_ids']
            chunk_id = encoding['chunk_ids']
            input_id = torch.tensor(encoding['input_ids'])[None]
            attention_mask = torch.tensor(encoding['attention_mask'])[None]
            bbox = torch.tensor(encoding['bbox'])[None]
            pixel_values = torch.tensor(encoding["images_pixels"])

            # save data in dictionnaries
            if image_id not in images_ids_list: images_ids_list.append(image_id)

            if image_id in chunk_ids: chunk_ids[image_id].append(chunk_id)
            else: chunk_ids[image_id] = [chunk_id]

            if image_id in input_ids: input_ids[image_id].append(input_id)
            else: input_ids[image_id] = [input_id]

            if image_id in bboxes: bboxes[image_id].append(bbox)
            else: bboxes[image_id] = [bbox]

            if image_id in pixels_values: pixels_values[image_id].append(pixel_values)
            else: pixels_values[image_id] = [pixel_values]

            # get prediction with forward pass
            with torch.no_grad():
                output = model(
                    input_ids=input_id.to(device),
                    attention_mask=attention_mask.to(device),
                    bbox=bbox.to(device),
                    image=pixel_values.to(device)
                    )

            # save probabilities of predictions in dictionnary
            if image_id in outputs: outputs[image_id].append(F.softmax(output.logits.squeeze(), dim=-1))
            else: outputs[image_id] = [F.softmax(output.logits.squeeze(), dim=-1)]

        return outputs, images_ids_list, chunk_ids, input_ids, bboxes

    else:
        print("An error occurred while getting predictions!")

In [ ]:

from functools import reduce

# Get predictions (line level)
def predictions_line_level(dataset, outputs, images_ids_list, chunk_ids, input_ids, bboxes):

    ten_probs_dict, ten_input_ids_dict, ten_bboxes_dict = dict(), dict(), dict()
    bboxes_list_dict, input_ids_dict_dict, probs_dict_dict, df = dict(), dict(), dict(), dict()

    if len(images_ids_list) > 0:

        for i, image_id in enumerate(images_ids_list):

            # get image information
            images_list = dataset.filter(lambda example: example["images_ids"] == image_id)["images"]
            image = images_list[0]
            width, height = image.size

            # get data
            chunk_ids_list = chunk_ids[image_id]
            outputs_list = outputs[image_id]
            input_ids_list = input_ids[image_id]
            bboxes_list = bboxes[image_id]

            # create zeros tensors
            ten_probs = torch.zeros((outputs_list[0].shape[0] - 2)*len(outputs_list), outputs_list[0].shape[1])
            ten_input_ids = torch.ones(size=(1, (outputs_list[0].shape[0] - 2)*len(outputs_list)), dtype =int)
            ten_bboxes = torch.zeros(size=(1, (outputs_list[0].shape[0] - 2)*len(outputs_list), 4), dtype =int)

            if len(outputs_list) > 1:
              
                for num_output, (output, input_id, bbox) in enumerate(zip(outputs_list, input_ids_list, bboxes_list)):
                    start = num_output*(max_length - 2) - max(0,num_output)*doc_stride
                    end = start + (max_length - 2)
                    
                    if num_output == 0:
                        ten_probs[start:end,:] += output[1:-1]
                        ten_input_ids[:,start:end] = input_id[:,1:-1]
                        ten_bboxes[:,start:end,:] = bbox[:,1:-1,:]
                    else:
                        ten_probs[start:start + doc_stride,:] += output[1:1 + doc_stride]
                        ten_probs[start:start + doc_stride,:] = ten_probs[start:start + doc_stride,:] * 0.5
                        ten_probs[start + doc_stride:end,:] += output[1 + doc_stride:-1]

                        ten_input_ids[:,start:start + doc_stride] = input_id[:,1:1 + doc_stride]
                        ten_input_ids[:,start + doc_stride:end] = input_id[:,1 + doc_stride:-1]

                        ten_bboxes[:,start:start + doc_stride,:] = bbox[:,1:1 + doc_stride,:]
                        ten_bboxes[:,start + doc_stride:end,:] = bbox[:,1 + doc_stride:-1,:]
              
            else:
                ten_probs += outputs_list[0][1:-1] 
                ten_input_ids = input_ids_list[0][:,1:-1] 
                ten_bboxes = bboxes_list[0][:,1:-1] 

            ten_probs_list, ten_input_ids_list, ten_bboxes_list = ten_probs.tolist(), ten_input_ids.tolist()[0], ten_bboxes.tolist()[0]
            bboxes_list = list()
            input_ids_dict, probs_dict = dict(), dict()
            bbox_prev = [-100, -100, -100, -100]
            for probs, input_id, bbox in zip(ten_probs_list, ten_input_ids_list, ten_bboxes_list):
                bbox = denormalize_box(bbox, width, height)
                if bbox != bbox_prev and bbox != cls_box and bbox != sep_box and bbox[0] != bbox[2] and bbox[1] != bbox[3]:
                    bboxes_list.append(bbox)
                    input_ids_dict[str(bbox)] = [input_id]
                    probs_dict[str(bbox)] = [probs]
                elif bbox != cls_box and bbox != sep_box and bbox[0] != bbox[2] and bbox[1] != bbox[3]:
                    input_ids_dict[str(bbox)].append(input_id)
                    probs_dict[str(bbox)].append(probs)
                bbox_prev = bbox

            probs_bbox = dict()
            for i,bbox in enumerate(bboxes_list):
                probs = probs_dict[str(bbox)]
                probs = np.array(probs).T.tolist()
            
                probs_label = list()
                for probs_list in probs:
                    prob_label = reduce(lambda x, y: x*y, probs_list)
                    prob_label  = prob_label**(1./(len(probs_list))) # normalization
                    probs_label.append(prob_label)
                max_value = max(probs_label)
                max_index = probs_label.index(max_value)
                probs_bbox[str(bbox)] = max_index

            bboxes_list_dict[image_id] = bboxes_list
            input_ids_dict_dict[image_id] = input_ids_dict
            probs_dict_dict[image_id] = probs_bbox

            df[image_id] = pd.DataFrame()
            df[image_id]["bboxes"] = bboxes_list
            df[image_id]["texts"] = [tokenizer.decode(input_ids_dict[str(bbox)]) for bbox in bboxes_list]
            df[image_id]["labels"] = [id2label[probs_bbox[str(bbox)]] for bbox in bboxes_list]

        return probs_bbox, bboxes_list_dict, input_ids_dict_dict, probs_dict_dict, df

    else:
        print("An error occurred while getting predictions!")

In [ ]:

# Get labeled images with lines bounding boxes
def get_labeled_images(dataset, images_ids_list, bboxes_list_dict, probs_dict_dict):

    labeled_images = list()

    for i, image_id in enumerate(images_ids_list):

        # get image
        images_list = dataset.filter(lambda example: example["images_ids"] == image_id)["images"]
        image = images_list[0]
        width, height = image.size

        # get predicted boxes and labels
        bboxes_list = bboxes_list_dict[image_id]
        probs_bbox = probs_dict_dict[image_id]

        draw = ImageDraw.Draw(image)
        # https://stackoverflow.com/questions/66274858/choosing-a-pil-imagefont-by-font-name-rather-than-filename-and-cross-platform-f
        font = font_manager.FontProperties(family='sans-serif', weight='bold')
        font_file = font_manager.findfont(font)
        font_size = 30
        font = ImageFont.truetype(font_file, font_size)

        for bbox in bboxes_list:
              predicted_label = id2label[probs_bbox[str(bbox)]]
              draw.rectangle(bbox, outline=label2color[predicted_label])
              draw.text((bbox[0] + 10, bbox[1] - font_size), text=predicted_label, fill=label2color[predicted_label], font=font)

        labeled_images.append(image)

    return labeled_images

In [ ]:

# get data of encoded chunk
def get_encoded_chunk_inference(index_chunk=None):

  # get datasets
  example = dataset
  encoded_example = encoded_dataset

  # get randomly a document in dataset
  if index_chunk == None: index_chunk = random.randint(0, len(encoded_example)-1)
  encoded_example = encoded_example[index_chunk]
  encoded_image_ids = encoded_example["images_ids"]

  # get the image
  example = example.filter(lambda example: example["images_ids"] ==  encoded_image_ids)[0]
  image = example["images"] # original image
  width, height = image.size
  page_no = example["page_no"]
  num_pages = example["num_pages"]

  # get boxes, texts, categories
  bboxes, input_ids  = encoded_example["normalized_bboxes"][1:-1], encoded_example["input_ids"][1:-1]
  bboxes = [denormalize_box(bbox, width, height) for bbox in bboxes]
  num_tokens = len(input_ids) + 2

  # get unique bboxes and corresponding labels
  bboxes_list, input_ids_list = list(), list()
  input_ids_dict = dict()
  bbox_prev = [-100, -100, -100, -100]
  for i, (bbox, input_id) in enumerate(zip(bboxes, input_ids)):
    if bbox != bbox_prev:
      bboxes_list.append(bbox)
      input_ids_dict[str(bbox)] = [input_id]
    else:
      input_ids_dict[str(bbox)].append(input_id)
    
    # start_indexes_list.append(i)
    bbox_prev = bbox
  
  # do not keep "</s><pad><pad>..."
  if input_ids_dict[str(bboxes_list[-1])][0] == (tokenizer.convert_tokens_to_ids('</s>')):
    del input_ids_dict[str(bboxes_list[-1])]
    bboxes_list = bboxes_list[:-1]

  # get texts by line
  input_ids_list = input_ids_dict.values()
  texts_list = [tokenizer.decode(input_ids) for input_ids in input_ids_list]
  
  # display DataFrame
  df = pd.DataFrame({"texts": texts_list, "input_ids": input_ids_list, "bboxes": bboxes_list})

  return image, df, num_tokens, page_no, num_pages

In [ ]:

# display chunk of PDF image and its data
def display_chunk_lines_inference(index_chunk=None):

  # get image and image data
  image, df, num_tokens, page_no, num_pages = get_encoded_chunk_inference(index_chunk=index_chunk)

  # get data from dataframe
  input_ids = df["input_ids"]
  texts = df["texts"]
  bboxes = df["bboxes"]

  print(f'Chunk ({num_tokens} tokens) of the PDF (page: {page_no+1} / {num_pages})\n')

  # display image with bounding boxes
  print(">> PDF image with bounding boxes of lines\n")
  draw = ImageDraw.Draw(image)
            
  labels = list()
  for box, text in zip(bboxes, texts):
      color = "red"
      draw.rectangle(box, outline=color)
  
  # resize image to original
  width, height = image.size
  image = image.resize((int(0.5*width), int(0.5*height)))

  # convert to cv and display
  img = np.array(image, dtype='uint8') # PIL to cv2
  cv2_imshow(img)
  cv2.waitKey(0)

  # display image dataframe
  print("\n>> Dataframe of annotated lines\n")
  cols = ["texts",	"bboxes"]
  df = df[cols]
  display(df)

APP function¶

In [ ]:

# APP outputs 
def app_outputs(uploaded_pdf):
    filename, msg, images = pdf_to_images(uploaded_pdf)
    num_images = len(images)

    if not msg.startswith("Error with the PDF"):
    
        # Extraction of image data (text and bounding boxes)
        dataset, lines, row_indexes, par_boxes, line_boxes = extraction_data_from_image(images)
        # prepare our data in the format of the model
        encoded_dataset = dataset.map(prepare_inference_features, batched=True, batch_size=64, remove_columns=dataset.column_names)
        custom_encoded_dataset = CustomDataset(encoded_dataset, tokenizer)
        # Get predictions (token level)
        outputs, images_ids_list, chunk_ids, input_ids, bboxes = predictions_token_level(images, custom_encoded_dataset)
        # Get predictions (line level)
        probs_bbox, bboxes_list_dict, input_ids_dict_dict, probs_dict_dict, df = predictions_line_level(dataset, outputs, images_ids_list, chunk_ids, input_ids, bboxes)
        # Get labeled images with lines bounding boxes
        images = get_labeled_images(dataset, images_ids_list, bboxes_list_dict, probs_dict_dict)

        img_files = list()
        # get image of PDF without bounding boxes
        for i in range(num_images):
            if filename != "files/blank.png": img_file = f"img_{i}_" + filename.replace(".pdf", ".png")
            else: img_file = filename.replace(".pdf", ".png")
            images[i].save(img_file)
            img_files.append(img_file)

        if num_images < max_imgboxes:
            img_files += [image_blank]*(max_imgboxes - num_images)
            images += [Image.open(image_blank)]*(max_imgboxes - num_images)
            for count in range(max_imgboxes - num_images):
                df[num_images + count] = pd.DataFrame()
        else:
            img_files = img_files[:max_imgboxes]
            images = images[:max_imgboxes]
            df = dict(itertools.islice(df.items(), max_imgboxes))

        # save 
        csv_files = list()
        for i in range(max_imgboxes):
            csv_file = f"csv_{i}_" + filename.replace(".pdf", ".csv")
            csv_files.append(gr.File.update(value=csv_file, visible=True))
            df[i].to_csv(csv_file, encoding="utf-8", index=False)

    else:  
        img_files, images, csv_files = [""]*max_imgboxes, [""]*max_imgboxes, [""]*max_imgboxes
        img_files[0], img_files[1] = image_blank, image_blank
        images[0], images[1] = Image.open(image_blank), Image.open(image_blank)
        csv_file = "csv_wo_content.csv"
        csv_files[0], csv_files[1] = gr.File.update(value=csv_file, visible=True), gr.File.update(value=csv_file, visible=True)
        df, df_empty = dict(), pd.DataFrame()
        df[0], df[1] = df_empty.to_csv(csv_file, encoding="utf-8", index=False), df_empty.to_csv(csv_file, encoding="utf-8", index=False)
    
    return msg, img_files[0], img_files[1], images[0], images[1], csv_files[0], csv_files[1], df[0], df[1]

Model & tokenizer¶

In [ ]:

from transformers import LayoutLMv2ForTokenClassification # LayoutXLMTokenizerFast, 

import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# model
# tokenizer = LayoutXLMTokenizerFast.from_pretrained(model_id)
model = LayoutLMv2ForTokenClassification.from_pretrained(model_id);
model.to(device);

# feature extractor
from transformers import LayoutLMv2FeatureExtractor
feature_extractor = LayoutLMv2FeatureExtractor(apply_ocr=False)

# tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(tokenizer_id)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.

Moving 0 files to the new cache system

0it [00:00, ?it/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/3.32k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.48G [00:00<?, ?B/s]

/usr/local/lib/python3.9/dist-packages/transformers/models/layoutlmv2/feature_extraction_layoutlmv2.py:30: FutureWarning: The class LayoutLMv2FeatureExtractor is deprecated and will be removed in version 5 of Transformers. Please use LayoutLMv2ImageProcessor instead.
  warnings.warn(

Downloading (…)lve/main/config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

In [ ]:

# get labels
id2label = model.config.id2label
label2id = model.config.label2id
num_labels = len(id2label)

Inference¶

In [ ]:

#@title Get PDF and extract data by OCR
from google.colab import files
uploaded_pdf = files.upload()

Upload widget is only available when the cell has been executed in the current browser session. Please rerun this cell to enable.

Saving example.pdf to example.pdf

In [ ]:

#@title Converting PDF to Images
filename, images = pdf_to_images(uploaded_pdf)

The PDF "example.pdf"" was converted into 12 images.
Now, you can extract data from theses images (text, bounding boxes...).

In [ ]:

#@title Extraction of image data (text and bounding boxes)... this may take a while...
dataset, lines, row_indexes, par_boxes, line_boxes = extraction_data_from_image(images)

In [ ]:

#@title (checking) Display lines bounding boxes of the first page
# https://techtutorialsx.com/2020/12/29/python-opencv-draw-rectangles/

i = 0
image = images[i].copy() # PIL
width, height = image.size
img = np.array(image, dtype='uint8') # PIL to cv2

# paragraphs
for box in par_boxes[i]:
  cv2.rectangle(img, (box[0], box[1]), (box[2], box[3]), (255, 0, 0), 3) # blue
# lines
for box in line_boxes[i]:
  cv2.rectangle(img, (box[0], box[1]), (box[2], box[3]), (0, 0, 255), 2) # red

img = cv2.resize(img, (int(width/4), int(height/4)))
cv2_imshow(img)
cv2.waitKey(0)

df = pd.DataFrame({"texts": lines[i], "bboxes lines": line_boxes[i]})
display(df)

	texts	bboxes lines
0	BRIEFING	[205, 146, 408, 184]
1	The European Commission's	[374, 387, 1276, 444]
2	annual rule of law reports	[416, 471, 1235, 542]
3	A new monitoring tool	[539, 571, 1109, 613]
4	SUMMARY	[201, 688, 371, 713]
5	The annual rule of law reports, launched by th...	[200, 749, 1450, 781]
6	to the European institutions’ rule of law tool...	[201, 789, 1452, 809]
7	as it collects data on the state of the rule o...	[201, 828, 1452, 843]
8	drawing legal conclusions or giving specific r...	[201, 859, 1452, 880]
9	published in July 2021 and the third is expect...	[201, 894, 1452, 917]
10	permanent mechanism.	[201, 915, 502, 964]
11	The methodology adopted by the Commission prov...	[200, 985, 1452, 1006]
12	27 Member States: (i) justice systems; (ii) th...	[201, 1022, 1450, 1043]
13	{iv} other institutional issues related to che...	[201, 1059, 1452, 1080]
14	involvement of Member States in the preparatio...	[201, 1095, 1350, 1122]
15	The Member States are involved throughout the ...	[200, 1150, 1452, 1175]
16	on the rule of law that meets regularly with t...	[201, 1192, 1450, 1206]
17	contributions to the report; (iii) dialogue be...	[201, 1222, 1452, 1243]
18	network of contact persons, the group of conta...	[201, 1259, 1450, 1280]
19	contact points on corruption, and bilaterally ...	[201, 1300, 1452, 1316]
20	(v) the opportunity for each Member State to c...	[201, 1332, 1445, 1353]
21	The reports have met with some criticism from ...	[200, 1385, 1453, 1413]
22	descriptive, rather than prescriptive nature o...	[201, 1423, 1366, 1448]
23	IN THIS BRIEFING	[802, 1494, 1032, 1515]
24	Introduction	[851, 1547, 1001, 1570]
25	Methodology for preparing the annual rule of	[851, 1584, 1427, 1605]
26	law reports	[851, 1618, 985, 1645]
27	First annual rule of law report (September 2020)	[851, 1655, 1424, 1678]
28	Second annual rule of law report July 2021)	[851, 1689, 1379, 1713]
29	European Parliament position	[851, 1725, 1209, 1750]
30	Council position	[851, 1758, 1046, 1786]
31	Expert and stakeholder views	[851, 1794, 1204, 1815]
32	Conclusions	[851, 1828, 996, 1849]
33	ANNEX: Overview of the main findings of the	[851, 1865, 1426, 1884]
34	2021 report as regards judicial independence	[851, 1901, 1397, 1925]
35	EPRS \| European Parliamentary Research Service	[353, 2085, 1300, 2114]
36	EN	[1479, 2219, 1532, 2254]

Data encoding¶

Now, we need to prepare our data in the format of the model.

In [ ]:

encoded_dataset = dataset.map(prepare_inference_features, batched=True, batch_size=64, remove_columns=dataset.column_names)
encoded_dataset

Map:   0%|          | 0/12 [00:00<?, ? examples/s]

Out[ ]:

Dataset({
    features: ['images_ids', 'images_pixels', 'chunk_ids', 'input_ids', 'attention_mask', 'normalized_bboxes'],
    num_rows: 38
})

Checking of the encoded dataset¶

(Checking) Display a random annotated chunk image and its dataframe.

Note: the image is squared because of its normalization to 1000px vs 1000px in the encoded dataset (necessary for training the model).

In [ ]:

# get and image from random chunk
display_chunk_lines_inference()

Filter:   0%|          | 0/12 [00:00<?, ? examples/s]

Chunk (384 tokens) of the PDF (page: 6 / 12)

>> PDF image with bounding boxes of lines

>> Dataframe of annotated lines

	texts	bboxes
0	EPRS European Parliamentary Research Service	[200, 105, 750, 123]
1	followed by the Member State’s delegation pres...	[198, 187, 1453, 210]
2	their national cule of law framework, after wh...	[198, 224, 1452, 247]
3	practices.	[200, 261, 317, 287]
4	Expert and stakeholder views	[201, 322, 805, 355]
5	Following the publication of the first annual ...	[200, 390, 1450, 425]
6	convened a with Professor Petra Bard and Profe...	[200, 428, 1450, 449]
7	LIBE committee, argued that ‘simply publishing...	[200, 465, 1452, 493]
8	contain and address rule of law backsliding in...	[200, 502, 1450, 523]
9	annual reporting cycle will not, in and of its...	[198, 540, 1450, 561]
10	rule of law or deter legal hooligans, as the R...	[200, 575, 1450, 596]
11	making ne concrete recommendations.’ Furthermo...	[200, 612, 1450, 640]
12	euphemistic language (e.g. the term ‘reform’ t...	[200, 647, 1450, 678]
13	judicial independence); denying what he descri...	[196, 685, 1450, 708]
14	its focus on too short a period of time, which...	[200, 722, 1283, 750]
15	For Professor Pech, the authors of the report ...	[200, 776, 1450, 797]
16	under the Article 7 procedure (Poland, Hungary...	[200, 813, 1450, 835]
17	‘normalis[e] the abnormal’. In conclusion, he ...	[200, 849, 1452, 872]
18	Report has, for now, primarily resulted in giv...	[200, 886, 1450, 909]
19	of law problems everywhere...” And while we ar...	[198, 923, 1450, 947]

Create a custom dataset¶

We end our data preparation with a new class that keeps only the information needed for inference.

In [ ]:

from torch.utils.data import Dataset

class CustomDataset(Dataset):
  def __init__(self, dataset, tokenizer):
    self.dataset = dataset
    self.tokenizer = tokenizer

  def __len__(self):
    return len(self.dataset)

  def __getitem__(self, idx):
    # get item
    example = self.dataset[idx]
    encoding = dict()
    encoding["images_ids"] = example["images_ids"]
    encoding["chunk_ids"] = example["chunk_ids"]
    encoding["input_ids"] = example["input_ids"]
    encoding["attention_mask"] = example["attention_mask"]
    encoding["bbox"] = example["normalized_bboxes"]
    encoding["images_pixels"] = example["images_pixels"]
    
    return encoding
custom_encoded_dataset = CustomDataset(encoded_dataset, tokenizer)

Now, we can get the predictions!

Get predictions¶

LayoutXLM outputs labels at the token level, but we are interested in the predicted labels at the line level.

In [ ]:

#@title Get predictions (token level)
outputs, images_ids_list, chunk_ids, input_ids, bboxes = predictions_token_level(images, custom_encoded_dataset)

In [ ]:

#@title Get predictions (line level)
# probs_bbox, bboxes_list_dict, input_ids_dict_dict, probs_dict_dict, df = predictions_line_level(outputs, images_ids_list, chunk_ids, input_ids, bboxes)
probs_bbox, bboxes_list_dict, input_ids_dict_dict, probs_dict_dict, df = predictions_line_level(dataset, outputs, images_ids_list, chunk_ids, input_ids, bboxes)

Filter:   0%|          | 0/12 [00:00<?, ? examples/s]

Filter:   0%|          | 0/12 [00:00<?, ? examples/s]

Filter:   0%|          | 0/12 [00:00<?, ? examples/s]

Filter:   0%|          | 0/12 [00:00<?, ? examples/s]

Filter:   0%|          | 0/12 [00:00<?, ? examples/s]

Filter:   0%|          | 0/12 [00:00<?, ? examples/s]

Filter:   0%|          | 0/12 [00:00<?, ? examples/s]

Filter:   0%|          | 0/12 [00:00<?, ? examples/s]

Filter:   0%|          | 0/12 [00:00<?, ? examples/s]

Filter:   0%|          | 0/12 [00:00<?, ? examples/s]

Filter:   0%|          | 0/12 [00:00<?, ? examples/s]

Filter:   0%|          | 0/12 [00:00<?, ? examples/s]

In [ ]:

#@title Get labeled images with lines bounding boxes
labeled_images = get_labeled_images(dataset, images_ids_list, bboxes_list_dict, probs_dict_dict)

Filter:   0%|          | 0/12 [00:00<?, ? examples/s]

Filter:   0%|          | 0/12 [00:00<?, ? examples/s]

Filter:   0%|          | 0/12 [00:00<?, ? examples/s]

Filter:   0%|          | 0/12 [00:00<?, ? examples/s]

Filter:   0%|          | 0/12 [00:00<?, ? examples/s]

Filter:   0%|          | 0/12 [00:00<?, ? examples/s]

Filter:   0%|          | 0/12 [00:00<?, ? examples/s]

Filter:   0%|          | 0/12 [00:00<?, ? examples/s]

Filter:   0%|          | 0/12 [00:00<?, ? examples/s]

Filter:   0%|          | 0/12 [00:00<?, ? examples/s]

Filter:   0%|          | 0/12 [00:00<?, ? examples/s]

Filter:   0%|          | 0/12 [00:00<?, ? examples/s]

Display labeled images and get their dataframes¶

In [ ]:

print(f"Number of PDF page images: {len(labeled_images)}")

Number of PDF page images: 12

In [ ]:

#@title Labeled images
import matplotlib.pyplot as plt
%matplotlib inline

plt.figure(figsize=(15,10))
columns = 3
for i, image in enumerate(labeled_images):
  plt.subplot(int(len(images) / columns + 1), columns, i + 1)
  plt.xticks(color="white")
  plt.yticks(color="white")
  plt.tick_params(bottom = False)
  plt.tick_params(left = False)
  plt.imshow(image)

In [ ]:

#@title Display one labeled image (full size)
num_page = 0

print(f"Image of the labeled page {num_page} (at line level).")
labeled_images[num_page]

Image of the labeled page 0 (at line level).

Out[ ]:

In [ ]:

#@title DataFrame of one image
num_page = 0

print(f"DataFrame of the page {num_page} (at line level).")
df[num_page]

DataFrame of the page 0 (at line level).

Out[ ]:

	bboxes	texts	labels
0	[203, 145, 406, 182]	BRIEFING	Section-header
1	[373, 385, 1275, 442]	The European Commission's	Title
2	[415, 470, 1233, 540]	annual rule of law reports	Title
3	[537, 570, 1108, 612]	A new monitoring tool	Title
4	[200, 687, 370, 711]	SUMMARY	Section-header
5	[198, 748, 1448, 778]	The annual rule of law reports, launched by th...	Text
6	[200, 788, 1450, 806]	to the European institutions’ rule of law tool...	Text
7	[200, 825, 1450, 842]	as it collects data on the state of the rule o...	Text
8	[200, 858, 1450, 879]	drawing legal conclusions or giving specific r...	Text
9	[200, 893, 1450, 916]	published in July 2021 and the third is expect...	Text
10	[200, 914, 501, 963]	permanent mechanism.	Text
11	[198, 984, 1450, 1005]	The methodology adopted by the Commission prov...	Text
12	[200, 1019, 1448, 1040]	27 Member States: (i) justice systems; (ii) th...	Text
13	[200, 1057, 1450, 1078]	{iv} other institutional issues related to che...	Text
14	[200, 1094, 1349, 1120]	involvement of Member States in the preparatio...	Text
15	[198, 1148, 1450, 1174]	The Member States are involved throughout the ...	Text
16	[200, 1190, 1448, 1204]	on the rule of law that meets regularly with t...	Text
17	[200, 1220, 1450, 1242]	contributions to the report; (iii) dialogue be...	Text
18	[200, 1258, 1448, 1279]	network of contact persons, the group of conta...	Text
19	[200, 1298, 1450, 1314]	contact points on corruption, and bilaterally ...	Text
20	[200, 1330, 1443, 1351]	(v) the opportunity for each Member State to c...	Text
21	[198, 1384, 1452, 1412]	The reports have met with some criticism from ...	Text
22	[200, 1422, 1364, 1447]	descriptive, rather than prescriptive nature o...	Text
23	[800, 1492, 1030, 1513]	IN THIS BRIEFING	Section-header
24	[850, 1546, 1000, 1569]	Introduction	Section-header
25	[850, 1583, 1425, 1604]	Methodology for preparing the annual rule of	Table
26	[850, 1616, 984, 1644]	law reports	Table
27	[850, 1653, 1422, 1677]	First annual rule of law report (September 2020)	Table
28	[850, 1688, 1377, 1712]	Second annual rule of law report July 2021)	Table
29	[850, 1723, 1207, 1749]	European Parliament position	Table
30	[850, 1756, 1045, 1784]	Council position	Table
31	[850, 1791, 1202, 1812]	Expert and stakeholder views	Table
32	[850, 1826, 995, 1847]	Conclusions	Table
33	[850, 1864, 1425, 1882]	ANNEX: Overview of the main findings of the	Table
34	[850, 1899, 1395, 1924]	2021 report as regards judicial independence	Table
35	[352, 2084, 1298, 2112]	EPRS \| European Parliamentary Research Service	Page-footer
36	[1478, 2217, 1531, 2252]	EN	Page-footer

Document AI | Inference at line level with a Document Understanding model (LayoutXLM base fine-tuned on DocLayNet dataset)¶

Inference at line level¶

Install Detectron 2, PDF2image, LangDetect & PyTesseract OCR¶

Detectron 2¶

PDF2image¶

LangDetect¶

Tesseract OCR¶

Kill runtime¶

Set-up environment¶

Libraries¶

Key parameters¶

Functions¶

General¶

PDF processing¶

Inference¶

APP function¶

Model & tokenizer¶

Inference¶

Data encoding¶

Checking of the encoded dataset¶

Create a custom dataset¶

Get predictions¶

Display labeled images and get their dataframes¶

END¶