A 3B Open Model That Rivals Giants at Math

VibeThinker-3B, an MIT-licensed 3-billion-parameter model, matches 671B and 1T rivals on math benchmarks and fits on a single GPU.

↻ Published 2026-06-20◷ 4 min readEA

Evgenii Arsentev · PhD

New posts every dayFollow me on TelegramWhere the AI world lives — daily AI news + Claude Code tipsFollow →Free Claude Code courseNo upsells, no cross-sells — nothing to buy here.Start free →

A new open-source model with just 3 billion parameters is matching reasoning systems hundreds of times larger on hard math. VibeThinker-3B, built by Sina Weibo and released under the permissive MIT license, scores 94.3 on the AIME26 math benchmark — edging out DeepSeek V3.2 (671 billion parameters) at 94.2 and Kimi K2.5 (1 trillion) at 93.3. On a held-out set of unseen LeetCode contests, it passed 123 of 128 problems on the first attempt.

The size gap is the headline. VibeThinker isn't trained from scratch; it's a post-trained specialist built on Qwen2.5-Coder-3B. Its BF16 weights weigh in around 6 GB, small enough to run on a single consumer GPU. A model you can fit on one card is competing, on a specific kind of problem, with models that need a server rack.

How a small model punches up

The trick is the training recipe, which the team calls Spectrum-to-Signal. In plain terms: first teach the model a wide spread of possible solution paths instead of one rote answer (the "spectrum"), then use reinforcement learning to sharpen the ones that actually work (the "signal"), focusing effort on problems right at the edge of what the model can currently do. There's also an optional test-time step that generates several attempts, checks its own intermediate claims, and votes on the most reliable answer — pushing AIME26 as high as 97.1 with no extra parameters.

It's worth being precise about where it shines and where it doesn't. On math and competitive coding, VibeThinker-3B trades blows with the giants. On broad knowledge tests like GPQA-Diamond it scores 70.2, well behind DeepSeek's 82.4 and Kimi's 87.6 — because raw factual recall really does scale with size, and 3 billion parameters can only hold so much. This is a sharp tool, not a Swiss Army knife.

Why it matters for you

The practical takeaway is about cost and control. If a 3B model can handle your narrow task — math, code, structured reasoning — you can run it locally, privately, and for roughly the price of electricity, instead of paying per token to a frontier API. For anyone building a product on top of AI, that's the difference between a feature that's free to run and one that bleeds money at scale.

It also chips away at the assumption that bigger is always better. The story of 2026 isn't just trillion-parameter labs racing each other; it's small, specialized models getting good enough that the frontier stops being the only option. MIT licensing means anyone can take VibeThinker, inspect it, fine-tune it, and ship it — no permission, no usage cap, no kill switch.

My read: don't reach for the biggest model out of habit. For a tightly scoped job, a tuned small model is often faster, cheaper, and entirely yours — and 'entirely yours' is worth a lot when an API you don't control can change its price or pull a model overnight.

ℹWhat I'd actually do

Before defaulting to a frontier API, ask whether your task is actually narrow — math, code, classification. If it is, test a small open model like this one on your real examples. If it clears the bar, you've just swapped a recurring bill for a one-time setup that runs on your own hardware.

#open source#AI models#reasoning

Related guides

Author

Evgenii Arsentev

PhD · Chief Product Officer at a tech company

About the author →

Want to actually build this?

Guides explain. The free course transforms — personalized, gamified, and built to get you shipping fast.

◉ Start the free course

← All news

Source: marktechpost.com