GPT-5.5-Cyber Edges Anthropic's Mythos at Security
OpenAI's new GPT-5.5-Cyber scores 85.6% on CyberGym, topping Anthropic's Mythos 5, while its Codex plugin already scanned 30M+ commits for vulnerabilities.
Evgenii Arsentev · PhD85.6% against 83.8% on CyberGym — that's the headline number, but the more interesting figure is the jump GPT-5.5-Cyber makes over OpenAI's own base model. On ExploitGym — a test for real vulnerability-finding capability — the specialized model hits 39.5% versus 25.95% for base GPT-5.5. On SEC-bench Pro: 69.8% versus 63.1%. OpenAI released GPT-5.5-Cyber today as the centerpiece of its expanded Daybreak cybersecurity program, and the benchmark gap makes a simple argument: a model trained specifically on cybersecurity data performs meaningfully better on security tasks, not just marginally.
The Codex Security plugin that ships alongside the model has already been running at scale. It has scanned more than 30 million commits across 30,000+ codebases, automatically flagging 500,000 findings as addressed, with another 70,000 confirmed by human reviewers. One guardrail that stays in place: every patch the model proposes still requires a person to review and approve before it is merged. The AI spots the problem and drafts the fix; a human decides whether it ships.
Who's already in the scanning net
The partner ecosystem is substantial: 25+ security companies including Cisco, CrowdStrike, Cloudflare, Palo Alto Networks, IBM, Fortinet, Wiz, SentinelOne and Palantir, plus eight governments — Australia, Canada, France, Germany, Japan, South Korea, the UK and the EU's ENISA. On the open source side, the projects signed on to Patch the Planet include cURL, Go, Python and Sigstore. These are the libraries that sit underneath most modern apps and APIs, which means vulnerabilities in code you didn't write but rely on are increasingly in scope.
What the specialization gap means for builders
The 13-point gap between GPT-5.5-Cyber and base GPT-5.5 on ExploitGym is evidence for a pattern that holds across domains: when a capable base model gets trained specifically on a well-defined task, the performance lift is real and substantial. For anyone building tools in specialized fields — security, medicine, law, finance — this is an argument for domain-specific models rather than just trying harder prompts on a general one.
There's also a practical infrastructure implication. Millions of repositories are being actively scanned. Python, cURL and Go — languages and libraries almost every modern project touches — are inside that net. The bet OpenAI is making is that automated scanning at this scale, with human oversight on every patch, can close the gap between the rate vulnerabilities are introduced and the rate they're found and fixed.
If you have a public repository or an app that other people log into, it's worth running Codex Security over your codebase — or, more practically, making a dedicated AI request to 'find injection vulnerabilities and hardcoded credentials' as a separate pass before pushing any significant changes. Security isn't a one-time audit at launch; it's the check you run every time something important changes.
Related guides

Author
Evgenii Arsentev
PhD · Chief Product Officer at a tech company
Want to actually build this?
Guides explain. The free course transforms — personalized, gamified, and built to get you shipping fast.
◉ Start the free courseSource: the-decoder.com