PDF to Structured JSON, 3× Faster Than Gemini

Datalab released lift — a self-hosted 9B model that turns any PDF into schema-valid JSON at 90.2% accuracy, 3× faster than Gemini Flash 3.5.

↻ Published 2026-06-23◷ 4 min readEA

Evgenii Arsentev · PhD

New posts every dayFollow me on TelegramWhere the AI world lives — daily AI news + Claude Code tipsFollow →Free Claude Code courseNo upsells, no cross-sells — nothing to buy here.Start free →

Datalab released lift, a 9-billion-parameter open-weights model built for one task: take a PDF, take a list of fields you want extracted, and return a clean JSON object. On a 225-document benchmark covering everything from 6-page invoices to 64-page reports — roughly 11,000 scored fields — it hit 90.2% field accuracy, beating NuExtract3 at 81.5% and Qwen3.5-9B at 76.3%.

Two design choices make lift practical for real use. First, output is schema-constrained: the model is not just prompted to return JSON, it is mechanically forced to. The structure always matches what you defined. Second, lift uses trained abstention — when a field is not present in the document, it returns null instead of guessing. That second point is the one that matters most in production. An AI model that confidently returns a wrong invoice total is worse than a blank cell: you might not notice the error until money has already moved.

What this means if you work with documents

Think about any workflow that starts with a PDF: invoices waiting to be logged in accounting software, insurance forms to be filed, research papers to be indexed by date and author, bank statements to be parsed for transactions. These are almost always done by hand or with brittle scripts that break the moment the document template changes. lift is designed to replace that layer. You describe the fields you need — invoice_number, vendor_name, total_amount — and the model returns exactly those from whatever PDF you give it, with a median of 9.5 seconds per document.

Compared to cloud APIs like Gemini Flash 3.5, lift is three times faster on the same extraction task. It runs on your own hardware: the code ships under Apache 2.0 and the model weights under a Modified OpenRAIL-M license. No per-document fees, and no data leaves your server — which matters when the PDFs contain confidential financial or customer information.

ℹWhat I'd actually do

If you have any process where someone manually copies data from PDFs into a spreadsheet or form — this is worth testing. Define your schema: just a JSON object listing the field names you need. Run lift on a batch of real documents and check the error rate before connecting it to anything that touches money or decisions. The 90.2% accuracy is strong, but for payment-critical fields, add a spot-check layer on top.

#open-source#models#documents#tools

Related guides

Author

Evgenii Arsentev

PhD · Chief Product Officer at a tech company

About the author →

Want to actually build this?

Guides explain. The free course transforms — personalized, gamified, and built to get you shipping fast.

◉ Start the free course

← All news

Source: marktechpost.com