AI Solves Just 3% of Real Knowledge Work, Test Finds

A new benchmark, AA-Briefcase, hands AI realistic multi-week office projects. The best model, Claude Fable 5, fully solves only 3% of the 91 tasks.

↻ Published 2026-06-19◷ 4 min readEA

Evgenii Arsentev · PhD

New posts every dayFollow me on TelegramWhere the AI world lives — daily AI news + Claude Code tipsFollow →Free Claude Code courseNo upsells, no cross-sells — nothing to buy here.Start free →

The most capable AI model on the market fully completes just 3% of realistic knowledge-work projects, according to AA-Briefcase, a benchmark published on June 19, 2026 by the analysis firm Artificial Analysis. That top score belongs to Claude Fable 5; every other model tested did worse. The benchmark is built to mimic the kind of work people actually do at a desk over weeks — not the tidy, self-contained questions that most AI tests use.

Instead of a clean prompt, AA-Briefcase gives a model a mess: information fragmented across Slack threads, email chains, meeting transcripts and large data exports, the way a real multi-week project arrives. To score well, a model has to dig the relevant facts out of that pile, keep them straight over a long stretch of work, and produce something usable at the end. Across 91 tasks, 31 of them saw no model reach even a 50% pass rate — meaning on a third of the workload, the best AI couldn't get half the steps right.

Better models fail in a sneakier way

One of the more useful findings is how the mistakes change as models improve. "The types of errors shift as models get better," the report notes. "Weaker models choke on basic execution as they miss relevant files or spit out unusable results. Stronger models fail more quietly, as they hit the obvious requirements but miss details you'd only catch by piecing together information from multiple sources." In other words, the cheap models break in ways you'll notice immediately; the expensive ones break in ways that look finished and pass a quick glance.

Cost is the other eye-opener. The benchmark put the price of a single task anywhere from $0.04 on DeepSeek V4 Flash to more than $31 on Claude Fable 5 — an 800x spread. Paying 800 times more buys you a better result, but not a reliable one: even at the top of the range you're getting a 3% full-completion rate. My honest read after a year of leaning on these tools daily: this matches what it feels like in practice. AI is great at the first 80% of a long task and quietly drops the details that only surface when you cross-reference everything — which is exactly the part a human still has to own.

ℹWhat I'd actually do

Don't hand AI a fuzzy, multi-source project and trust the output because it reads well — that's precisely where the strong models fail "quietly." Break long tasks into checkable chunks, feed sources in deliberately rather than dumping everything, and always verify the synthesized conclusions against the originals yourself. The benchmark's message in one line: AI is a fast first-drafter, not a finisher you can stop checking.

#ai-benchmarks#agents#ai-at-work

Related guides

Author

Evgenii Arsentev

PhD · Chief Product Officer at a tech company

About the author →

Want to actually build this?

Guides explain. The free course transforms — personalized, gamified, and built to get you shipping fast.

◉ Start the free course

← All news

Source: the-decoder.com