Notebook

Running RolmOCR on Vast¶

RolmOCR from Reducto is a powerful, open-source document OCR solution that delivers superior performance while requiring fewer resources than comparable models. Built on Qwen2.5-VL-7B, this model excels at parsing complex documents including PDFs, invoices, and forms without requiring metadata extraction. Because it is open source, companies can build on top of this model for their own proprietary pipelines while not sending data to other providers or model hosting services.

Vast.ai offers a GPU marketplace where you can rent compute power at lower costs than major cloud providers, with the flexibility to select specific hardware configurations optimized for specific models, while keeping all of your company's data private.

This notebook demonstrates how to extract structured pricing data from invoice images using reducto/RolmOCR.

Deploy `reducto/RolmOCR` on Vast¶

Install Vast¶

First, we will install and set up the Vast API.

You can get your API key on the Account Page in the Vast Console and set it below in VAST_API_KEY

In [ ]:

%%bash
pip install vastai==0.2.6

In [ ]:

%%bash
export VAST_API_KEY="" #Your key here
vastai set api-key $VAST_API_KEY

Search for an Instance¶

Next, we'll search for an instance to host our model. While reducto/RolmOCR requires at least 16GB VRAM, we'll select an instance with 60GB VRAM to accommodate larger documents and enable a wider context window.

In [ ]:

%%bash
vastai search offers "compute_cap >= 750 \
geolocation=US \
gpu_ram >= 60 \
num_gpus = 1 \
static_ip = true \
direct_port_count >= 1 \
verified = true \
disk_space >= 80 \
rentable = true"

Deploy our Instance¶

Finally, we will use the instance ID from our search to deploy our model to our instance.

Note: we set VLLM_USE_V1=1 to use the v1 engine for vLLM, which reducto/RolmOCR requires.

In [ ]:

%%bash
export INSTANCE_ID= #insert instance ID
vastai create instance $INSTANCE_ID --image vllm/vllm-openai:latest --env '-p 8000:8000 -e VLLM_USE_V1=1' --disk 80 --args --model reducto/RolmOCR

Using the OpenAI API to Call `reducto/RolmOCR`¶

Download dependencies¶

First, we will install our dependencies.

In [ ]:

%%bash 
pip install --upgrade openai datasets pydantic

Download Sample Invoice Data¶

We will use datasets to get a subset of the invoice data from the katanaml-org/invoices-donut-data-v1 dataset on Hugging Face, which contains 500 annotated invoice images with structured metadata for training document extraction models.

In [ ]:

from datasets import load_dataset

# Stream the dataset
streamed_dataset = load_dataset("katanaml-org/invoices-donut-data-v1", split="train", streaming=True)

# Take the first 3 samples
subset = list(streamed_dataset.take(3))

We will then create an encode_pil_image function to convert the images from our dataset to a base64 encoded image to be passed into the OpenAI API.

In [22]:

import base64
import io
from PIL import Image

def encode_pil_image(pil_image):
    # Resize image to a smaller size while maintaining aspect ratio
    max_size = 1024  # Maximum dimension
    ratio = min(max_size / pil_image.width, max_size / pil_image.height)
    new_size = (int(pil_image.width * ratio), int(pil_image.height * ratio))
    resized_image = pil_image.resize(new_size, Image.Resampling.LANCZOS)
    
    # Convert PIL Image to bytes
    img_byte_arr = io.BytesIO()
    resized_image.save(img_byte_arr, format='JPEG', quality=85)  # Reduced quality for smaller size
    img_byte_arr = img_byte_arr.getvalue()
    return base64.b64encode(img_byte_arr).decode("utf-8")

Enforcing Structured Data Extraction¶

We'll define an Invoice schema with Pydantic to ensure our model returns precisely formatted data.

In [ ]:

from pydantic import BaseModel

class Invoice(BaseModel):
    invoice_number: str
    invoice_amount: str


json_schema = Invoice.model_json_schema()

Configuring the API Client¶

Next, we'll create our ocr_page_with_rolm function that interfaces with the RolmOCR endpoint.

We'll also add our VAST_IP_ADDRESS and VAST_PORT for our running instance. We'll find these in the Instances tab of our Vast AI Console.

In [ ]:

from openai import OpenAI

VAST_IP_ADDRESS = ""
VAST_PORT = ""
base_url = f"http://{VAST_IP_ADDRESS}:{VAST_PORT}/v1"


client = OpenAI(api_key="", base_url=base_url)

model = "reducto/RolmOCR"

def ocr_page_with_rolm(img_base64):
    response = client.chat.completions.create(
        model=model,
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/png;base64,{img_base64}"},
                    },
                    {
                        "type": "text",
                        "text": "Return the invoice number and total amount for each invoice as a json: {invoice_number : str, invoice_amount: str}",
                    },
                ],
            }
        ],
        extra_body={"guided_json": json_schema},
        temperature=0.2,
        max_tokens=500
    )
    return response.choices[0].message.content

Next, we will iterate over our subset to extract invoice_number and invoice_amount. We will display the original invoice and compare our json output to the ground truth data.

In [ ]:

import matplotlib.pyplot as plt
import json

invoices = []
ground_truth = []
for sample in subset:
    # Display the image
    plt.figure(figsize=(10, 14))
    plt.imshow(sample['image'])
    plt.axis('off')
    plt.show()
    
    # Process with OCR
    img_base64 = encode_pil_image(sample['image'])
    result = ocr_page_with_rolm(img_base64)
    result_dict = json.loads(result)
    invoices.append(result_dict)

    ground_truth_i = json.loads(sample["ground_truth"])
    ground_truth_dict = {
        "invoice_number":ground_truth_i["gt_parse"]["header"]["invoice_no"],
        "invoice_amount":ground_truth_i["gt_parse"]["summary"]["total_gross_worth"]
    }
    ground_truth.append(ground_truth_dict)

    print("Ground Truth")
    print(ground_truth_dict)

    print("Extracted Info")
    print(result_dict)

Output¶

Ground Truth
{'invoice_number': '40378170', 'invoice_amount': '$8,25'}
Extracted Info
{'invoice_number': '40378170', 'invoice_amount': '$8.25'}

Ground Truth
{'invoice_number': '61356291', 'invoice_amount': '$ 212,09'}
Extracted Info
{'invoice_number': '61356291', 'invoice_amount': '$212.09'}

Ground Truth
{'invoice_number': '49565075', 'invoice_amount': '$96,73'}
Extracted Info
{'invoice_number': '49565075', 'invoice_amount': '$96,73'}

We see that the output from reducto/RolmOCR matches the expected output from the ground truth associated with each invoice.

Conclusion¶

In this notebook, we've demonstrated how to deploy and use RolmOCR on Vast to extract structured data from invoice images.

RolmOCR proves to be highly accurate at identifying key information like invoice numbers and amounts. The combination of RolmOCR's efficiency with Vast.ai's cost-effective GPU options makes this an excellent solution for document processing workflows at scale.

This approach can be extended to extract other types of structured data from various document formats, enabling powerful automation capabilities for businesses of all sizes.

Running RolmOCR on Vast¶

Deploy reducto/RolmOCR on Vast¶

Install Vast¶

Search for an Instance¶

Deploy our Instance¶

Using the OpenAI API to Call reducto/RolmOCR¶

Download dependencies¶

Download Sample Invoice Data¶

Enforcing Structured Data Extraction¶

Configuring the API Client¶

Output¶

Conclusion¶

Deploy `reducto/RolmOCR` on Vast¶

Using the OpenAI API to Call `reducto/RolmOCR`¶