A Team of AIs Just Beat Opus 4.8 at Coding

Sakana AI's Fugu doesn't try to be the smartest single model — it routes each task to the best one in a pool, and the combo tops Opus 4.8, GPT 5.5 and Gemini on coding tests.

↻ Published 2026-06-22◷ 4 min readEA

Evgenii Arsentev · PhD

New posts every dayFollow me on TelegramWhere the AI world lives — daily AI news + Claude Code tipsFollow →Free Claude Code courseNo upsells, no cross-sells — nothing to buy here.Start free →

Sakana AI has launched Fugu, a system that beats the best single AI models not by being a bigger model, but by coordinating several of them. On SWE Bench Pro, which measures fixing real bugs in real codebases, Fugu Ultra scores 73.7 against 69.2 for Anthropic's Opus 4.8, 58.6 for GPT 5.5 and 54.2 for Gemini 3.1 Pro. It leads the same field on TerminalBench 2.1 (82.1), LiveCodeBench (93.2) and the GPQA-Diamond science exam (95.5), and Sakana says it performs on par with Anthropic's Fable 5 and Mythos — even though neither of those models is part of Fugu's pool.

The trick is that Fugu is itself a trained model whose job is to manage other models. Faced with a request, it decides whether to answer directly or delegate to a specialized model from a swappable pool, handling the selection, the delegation, internal checks, and the final synthesis on its own. To whoever calls it, all of that is invisible: Fugu appears as a single model behind one OpenAI-compatible API. It ships in two flavors — Fugu Base, tuned for low-latency everyday work like coding and chat, and Fugu Ultra, built for multi-step problems like research, security analysis and patent searches.

Why a committee beats a soloist

The numbers aren't just lab figures. One developer reported that on a code review, Fugu Ultra surfaced more than twenty issues where GPT-5.5 flagged about three. That tracks with the core idea: different models have different blind spots, and a system that can route a task to the right specialist — and double-check the answer — catches things any one model would miss. Sakana grounds this in two of its research papers, Trinity and Conductor, presented at ICLR 2026. The company was founded by former Google researchers Llion Jones, a co-author of the original 2017 Transformer paper, and David Ha.

Both versions are available now through an API and console, with subscription and usage-based billing. Fugu can also be told to exclude specific models from its pool for compliance reasons.

Why it matters for you

There are two practical wins here. The first is quality: if a coordinated team of models reliably outscores the single best model, the ceiling on what you can get from one API call goes up without you having to juggle several tools yourself. The second is independence. Sakana pitches Fugu explicitly as 'a safeguard against single-provider dependence,' pointing at recent cases where access to a top model was cut off worldwide overnight. If your workflow leans on one provider, a system that can quietly swap in another is real insurance, not just a benchmark flex.

My take: the orchestration idea matters more than the leaderboard. We've spent two years asking which single model is best; Fugu is a bet that the better question is which combination is best — and that bet just posted numbers.

ℹWhat I'd actually do

If you rely on one AI for coding or research, it's worth testing a router like Fugu on your hardest real tasks — the kind where a single model usually misses something — and compare what it catches versus your current tool. And don't ignore the lock-in angle: having a fallback that can switch providers is worth keeping in mind even if you don't switch today.

#Sakana AI#AI models#coding#benchmarks

Related guides

Author

Evgenii Arsentev

PhD · Chief Product Officer at a tech company

About the author →

Want to actually build this?

Guides explain. The free course transforms — personalized, gamified, and built to get you shipping fast.

◉ Start the free course

← All news

Source: the-decoder.com