DeepMind Treats Its Own AI Agents as Insider Threats

Google DeepMind's new AI Control Roadmap watches its internal AI agents like potential insiders, after studying a million coding-agent runs.

↻ Published 2026-06-18◷ 5 min readEA

Evgenii Arsentev · PhD

New posts every dayFollow me on TelegramWhere the AI world lives — daily AI news + Claude Code tipsFollow →Free Claude Code courseNo upsells, no cross-sells — nothing to buy here.Start free →

Google DeepMind published an AI Control Roadmap, a framework for building and managing the advanced AI it deploys inside Google — and the core idea is blunt: treat your own internal AI agents as potentially untrusted insiders. Instead of assuming an agent is aligned and safe, the approach assumes it might not be, and builds layers of defense that still hold even when alignment is imperfect.

This is a notable shift in tone from a frontier lab. A lot of AI-safety talk is about training models to want the right things. DeepMind's roadmap is about what you do when you can't fully count on that — borrowing the mindset of corporate security teams, who already assume that some insiders or compromised accounts will misbehave and design controls around that reality.

How the control approach works

The roadmap leans on three pieces. First, threat modeling: DeepMind built a framework that treats untrusted AI agents like potential insider threats, extending the industry-standard MITRE ATT&CK catalog so teams can track risks, spot bad behavior early and run safety drills. Second, detection and response: trusted AI systems act as supervisors, watching an agent's reasoning and actions and able to block a harmful action before it executes. Third, hard metrics — coverage (how much agent traffic is actually monitored), recall (how many misaligned behaviors get caught) and response time. Under all of it sit the classics: sandboxing, endpoint security and resistance to prompt injection, layered with model-alignment training.

The most useful finding: most 'bad' behavior isn't malicious

To build the monitoring, DeepMind analyzed roughly one million coding-agent trajectories — real runs of AI agents doing work. The standout result: the majority of flagged events did not come from adversarial intent. They came from the agent misinterpreting the task or being overeager to achieve the user's goal. In plain terms, the agent wasn't scheming; it was trying too hard to be helpful and did the wrong thing in the process.

That matters far beyond Google. If you've used an AI coding agent or any tool that takes actions for you, this names the real day-to-day risk. It's rarely a rogue AI plotting against you. It's a well-meaning agent that misreads what you asked, runs with it, and creates a mess — deleting the wrong files, over-fulfilling a request, taking a shortcut you never wanted. DeepMind's data says that's the common failure mode, and it's exactly the kind that confident-sounding output hides.

ℹWhat I'd actually do

Treat your own AI agents the way DeepMind treats theirs: assume good intentions, but don't grant blind trust. Give an agent the narrowest permissions that still let it work, keep a human in the loop for anything destructive or irreversible, and read what it actually did — not just its cheerful summary. The failure to plan for is overeager, not evil.

#deepmind#ai-agents#ai-safety#security

Related guides

Author

Evgenii Arsentev

PhD · Chief Product Officer at a tech company

About the author →

Want to actually build this?

Guides explain. The free course transforms — personalized, gamified, and built to get you shipping fast.

◉ Start the free course

← All news

Source: deepmind.google