Most AIs Go Broke Running a 500-Day Startup

Princeton's CEO-Bench ran 14 AI models as startup CEOs for 500 days. Only 3 turned a profit. A simple no-AI rule beat 11 of them. Claude Fable 5 won.

↻ Published 2026-06-29◷ 5 min readEA

Evgenii Arsentev · PhD

New posts every dayFollow me on TelegramWhere the AI world lives — daily AI news + Claude Code tipsFollow →Free Claude Code courseNo upsells, no cross-sells — nothing to buy here.Start free →

Princeton University researchers handed control of a fictional software company to 14 of today's best AI models and ran the clock for 500 simulated days. Each model started with $1 million and zero customers and had to make real business decisions through a programming interface — set prices, buy advertising, manage product quality, handle customer support, respond to what competitors were doing. The goal was simple: don't go broke. Most of them did.

Only three finished with more money than they started with. Claude Fable 5 led by a wide margin, turning $1 million into $47.15 million over the simulation. Claude Opus 4.8 finished at $27.8 million. GPT-5.5 reached $21.3 million. The remaining eleven went bankrupt before the 500 days were up.

The result that stings

The benchmark's most revealing number isn't which model won — it's that a simple rule-based algorithm with no AI at all finished at $15.76 million and beat 11 of the 14 models. The rule used fixed prices, fixed quotas, and focused targeting. No reasoning, no planning, no language model. Just a deterministic set of if-then instructions. And it outperformed nearly every AI in the test.

That result isn't a gotcha — it's a useful diagnostic. The task requires what the Princeton researchers call 'steering intelligence': the ability to coordinate decisions across months where a price change in week five affects customer retention in week thirty, and where competitors react to your moves in ways you can't predict. Current AI models are excellent at analyzing a single situation, generating a targeted plan, or writing a precise message. They're weaker at holding a coherent strategy across hundreds of decision cycles where conditions keep shifting.

The three models that succeeded had something in common. They explored new strategies rather than defaulting to cost-cutting. They inferred hidden information — in the simulation, customer satisfaction isn't visible directly; you have to read it from indirect signals and adapt. They predicted cash flow trends before problems arrived rather than reacting after the fact. And they adjusted quickly when competitors moved. These are behaviors that require maintaining a model of the whole situation over time, not just answering the next question well.

The benchmark, called CEO-Bench, simulates a company called NovaMind with realistic delays built in: revenue arrives at billing dates, R&D investments take weeks to pay off, hidden customer satisfaction scores drift based on decisions you made earlier. It's designed to expose exactly the skill that doesn't show up on typical AI benchmarks — sustained strategic coherence under pressure.

ℹWhat I'd actually do

If you're building with AI agents that run multi-step workflows, the CEO-Bench result is a useful frame. The failure mode isn't bad individual decisions — it's losing the strategic thread across many decisions under changing conditions. For anything consequential: build in regular moments where the agent explicitly checks whether the current action still makes sense given the original goal. And run your agents through longer tests before trusting them in production.

#AI agents#benchmark#AI capabilities#research

Related guides

Author

Evgenii Arsentev

PhD · Chief Product Officer at a tech company

About the author →

Want to actually build this?

Guides explain. The free course transforms — personalized, gamified, and built to get you shipping fast.

◉ Start the free course

← All news

Source: the-decoder.com