Mistral OCR 4: 93% Accuracy, 170 Languages, Self-Hosted

Mistral released OCR 4 — top scorer on every major document benchmark. It handles 170 languages, reports confidence per word, and can run on your own server.

4 min readEAEvgenii ArsentevEvgenii Arsentev · PhD

Mistral released OCR 4 on June 23 — a document-understanding model that scores 93.07 on OmniDocBench, the main industry benchmark for extraction quality. On OlmOCRBench it hit 85.20, the highest recorded score in its class. When Mistral brought in independent annotators to compare it head-to-head against competitors, they preferred OCR 4 in 72% of matchups.

These aren't just leaderboard numbers. Document extraction is one of the messiest and most underdiscussed bottlenecks in building AI products. PDFs with mixed layouts, tables spanning multiple columns, signatures, equations, and handwritten annotations routinely defeat general-purpose tools — and when the extraction breaks, every downstream step breaks with it.

What OCR 4 does differently

The model doesn't just transcribe text — it reads document structure. It classifies each block of content: whether it's a title, a paragraph, a table, an equation, or a signature. It marks the exact position of every text element on the page. And it produces a confidence score for each word and each page — a number indicating how certain the model is about what it read.

That confidence scoring changes the practical math on automation. If you're building a contract reviewer or an invoice processor, you currently face a hard choice: accept some error rate, or manually review everything. Confidence scores give you a third option — automatically accept high-confidence outputs and route only the flagged sections to a human. That's how you actually automate document processing at scale without cutting corners on accuracy.

OCR 4 handles 170 languages across ten language families, with particularly strong performance in low-resource languages — Arabic, Indic scripts, East Asian languages — where most models fall apart quickly.

Pricing and deployment

API access costs $4 per 1,000 pages through Mistral Studio. The Batch API, for large-volume jobs where immediate results aren't required, is $2 per 1,000 pages — half the rate. OCR 4 is also available on Amazon SageMaker and Microsoft Foundry, with Snowflake support coming.

The option worth paying attention to is self-hosted deployment: OCR 4 ships as a single container you run on your own server. That means no cloud subscription, no external processing of confidential files, and no per-page costs after the initial setup. For use cases involving contracts, financial documents, medical records, or anything with legal sensitivity, local deployment is often not optional — it's the only deployment that compliance allows.

What I'd actually do

If you're building something that reads PDFs — contracts, forms, invoices — grab access on Mistral Studio and run your nastiest test documents through OCR 4. Focus on the confidence scores: they'll tell you which parts of your extraction are trustworthy and which need a human check. For anything involving sensitive data that can't leave your infrastructure, the self-hosted container is worth evaluating seriously.

#Mistral#OCR#документы#self-hosted

Related guides

EAEvgenii Arsentev

Author

Evgenii Arsentev

PhD · Chief Product Officer at a tech company

Want to actually build this?

Guides explain. The free course transforms — personalized, gamified, and built to get you shipping fast.

◉ Start the free course

Source: mistral.ai