RolmOCR from Reducto is a powerful, open-source document OCR solution that delivers superior performance while requiring fewer resources than comparable models. Built on Qwen2.5-VL-7B, this model excels at parsing complex documents including PDFs, invoices, and forms without requiring metadata extraction. Because it is open source, companies can build on top of this model for their own proprietary pipelines while not sending data to other providers or model hosting services.
Vast.ai offers a GPU marketplace where you can rent compute power at lower costs than major cloud providers, with the flexibility to select specific hardware configurations optimized for specific models, while keeping all of your company's data private.
This notebook demonstrates how to extract structured pricing data from invoice images using reducto/RolmOCR
.
reducto/RolmOCR
on Vast¶First, we will install and set up the Vast API.
You can get your API key on the Account Page in the Vast Console and set it below in VAST_API_KEY
%%bash
pip install vastai==0.2.6
%%bash
export VAST_API_KEY="" #Your key here
vastai set api-key $VAST_API_KEY
Next, we'll search for an instance to host our model. While reducto/RolmOCR
requires at least 16GB VRAM, we'll select an instance with 60GB VRAM to accommodate larger documents and enable a wider context window.
%%bash
vastai search offers "compute_cap >= 750 \
geolocation=US \
gpu_ram >= 60 \
num_gpus = 1 \
static_ip = true \
direct_port_count >= 1 \
verified = true \
disk_space >= 80 \
rentable = true"
Finally, we will use the instance ID from our search to deploy our model to our instance.
Note: we set VLLM_USE_V1=1
to use the v1 engine for vLLM, which reducto/RolmOCR
requires.
%%bash
export INSTANCE_ID= #insert instance ID
vastai create instance $INSTANCE_ID --image vllm/vllm-openai:latest --env '-p 8000:8000 -e VLLM_USE_V1=1' --disk 80 --args --model reducto/RolmOCR
%%bash
pip install --upgrade openai datasets pydantic
We will use datasets
to get a subset of the invoice data from the katanaml-org/invoices-donut-data-v1
dataset on Hugging Face, which contains 500 annotated invoice images with structured metadata for training document extraction models.
from datasets import load_dataset
# Stream the dataset
streamed_dataset = load_dataset("katanaml-org/invoices-donut-data-v1", split="train", streaming=True)
# Take the first 3 samples
subset = list(streamed_dataset.take(3))
We will then create an encode_pil_image
function to convert the images from our dataset to a base64 encoded image to be passed into the OpenAI API.
import base64
import io
from PIL import Image
def encode_pil_image(pil_image):
# Resize image to a smaller size while maintaining aspect ratio
max_size = 1024 # Maximum dimension
ratio = min(max_size / pil_image.width, max_size / pil_image.height)
new_size = (int(pil_image.width * ratio), int(pil_image.height * ratio))
resized_image = pil_image.resize(new_size, Image.Resampling.LANCZOS)
# Convert PIL Image to bytes
img_byte_arr = io.BytesIO()
resized_image.save(img_byte_arr, format='JPEG', quality=85) # Reduced quality for smaller size
img_byte_arr = img_byte_arr.getvalue()
return base64.b64encode(img_byte_arr).decode("utf-8")
We'll define an Invoice
schema with Pydantic to ensure our model returns precisely formatted data.
from pydantic import BaseModel
class Invoice(BaseModel):
invoice_number: str
invoice_amount: str
json_schema = Invoice.model_json_schema()
Next, we'll create our ocr_page_with_rolm
function that interfaces with the RolmOCR endpoint.
We'll also add our VAST_IP_ADDRESS
and VAST_PORT
for our running instance. We'll find these in the Instances tab of our Vast AI Console.
from openai import OpenAI
VAST_IP_ADDRESS = ""
VAST_PORT = ""
base_url = f"http://{VAST_IP_ADDRESS}:{VAST_PORT}/v1"
client = OpenAI(api_key="", base_url=base_url)
model = "reducto/RolmOCR"
def ocr_page_with_rolm(img_base64):
response = client.chat.completions.create(
model=model,
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{img_base64}"},
},
{
"type": "text",
"text": "Return the invoice number and total amount for each invoice as a json: {invoice_number : str, invoice_amount: str}",
},
],
}
],
extra_body={"guided_json": json_schema},
temperature=0.2,
max_tokens=500
)
return response.choices[0].message.content
Next, we will iterate over our subset to extract invoice_number
and invoice_amount
. We will display the original invoice and compare our json output to the ground truth data.
import matplotlib.pyplot as plt
import json
invoices = []
ground_truth = []
for sample in subset:
# Display the image
plt.figure(figsize=(10, 14))
plt.imshow(sample['image'])
plt.axis('off')
plt.show()
# Process with OCR
img_base64 = encode_pil_image(sample['image'])
result = ocr_page_with_rolm(img_base64)
result_dict = json.loads(result)
invoices.append(result_dict)
ground_truth_i = json.loads(sample["ground_truth"])
ground_truth_dict = {
"invoice_number":ground_truth_i["gt_parse"]["header"]["invoice_no"],
"invoice_amount":ground_truth_i["gt_parse"]["summary"]["total_gross_worth"]
}
ground_truth.append(ground_truth_dict)
print("Ground Truth")
print(ground_truth_dict)
print("Extracted Info")
print(result_dict)
Ground Truth
{'invoice_number': '40378170', 'invoice_amount': '$8,25'}
Extracted Info
{'invoice_number': '40378170', 'invoice_amount': '$8.25'}
Ground Truth
{'invoice_number': '61356291', 'invoice_amount': '$ 212,09'}
Extracted Info
{'invoice_number': '61356291', 'invoice_amount': '$212.09'}
Ground Truth
{'invoice_number': '49565075', 'invoice_amount': '$96,73'}
Extracted Info
{'invoice_number': '49565075', 'invoice_amount': '$96,73'}
We see that the output from reducto/RolmOCR
matches the expected output from the ground truth associated with each invoice.
In this notebook, we've demonstrated how to deploy and use RolmOCR on Vast to extract structured data from invoice images.
RolmOCR proves to be highly accurate at identifying key information like invoice numbers and amounts. The combination of RolmOCR's efficiency with Vast.ai's cost-effective GPU options makes this an excellent solution for document processing workflows at scale.
This approach can be extended to extract other types of structured data from various document formats, enabling powerful automation capabilities for businesses of all sizes.