A Little Good Training Makes AI Broadly Harder to Trick

OpenAI found small doses of good behavior in training make models broadly safer: 44 of 53 benchmarks improved, with more resistance to manipulation by users.

↻ Published 2026-06-19◷ 4 min readEA

Evgenii Arsentev · PhD

New posts every dayFollow me on TelegramWhere the AI world lives — daily AI news + Claude Code tipsFollow →Free Claude Code courseNo upsells, no cross-sells — nothing to buy here.Start free →

OpenAI researchers report that adding a small amount of "beneficial trait" data to a model's standard reinforcement-learning post-training makes it broadly safer and noticeably harder to manipulate. Across 53 independent benchmarks measuring deception, honesty, sycophancy and reward hacking, 44 improved. The effect, covered by The Decoder, is notable because it generalizes far beyond the narrow situations the extra training actually covered.

The method is almost mundane. Researchers wrote realistic conversations that reinforce specific desirable traits — truthfulness, admitting uncertainty, corrigibility (willingness to be corrected), transparency in reasoning, fairness, and concern for human well-being — across domains like healthcare, education, science, law and engineering. Then they mixed a small portion of that data into the normal training pipeline. No new architecture, no exotic technique: just deliberately seeding good habits.

The part that surprised me

What makes this interesting is cross-domain transfer. A model trained to behave well on health scenarios also did better on non-health evaluations, and training on non-health material improved its health answers. Good behavior spread sideways. That mirrors an earlier, darker finding — that training a model to be bad in one narrow area made it broadly worse everywhere — and flips it into a positive. The same generalization that made misbehavior contagious also makes good behavior contagious.

The trained models also held up under pressure. Adversarial prompts had, in the researchers' words, "far less effect" than on baseline versions, and the models resisted harmful fine-tuning attempts while still following legitimate instructions — a property the team calls selective persistence. In plain terms: harder to jailbreak into doing something bad, but still happy to do what you actually asked.

ℹWhat I'd actually do

If you build on top of these models, this is quiet good news you can lean on: newer models are getting genuinely tougher to social-engineer into bad output. But don't treat it as a force field — 44 of 53 means 9 benchmarks didn't improve. Keep your own guardrails on user input; model-level safety is a layer, not the whole wall.

Why this matters for a regular user: a lot of AI risk isn't the model being evil, it's the model being gullible — flattering you into a bad decision, caving to a manipulative prompt, confidently making things up. Showing that a modest, cheap dose of 'be honest, admit what you don't know, accept correction' transfers across the whole model is a practical path to fewer of those failures. OpenAI frames its traits as empirically measurable behaviors, a contrast with Anthropic's approach of writing an explicit constitution; both are circling the same goal from different angles. The takeaway I'd keep: making AI safer here looked less like a heroic breakthrough and more like good habits, applied early, spreading on their own.

#openai#ai-safety#alignment

Related guides

Author

Evgenii Arsentev

PhD · Chief Product Officer at a tech company

About the author →

Want to actually build this?

Guides explain. The free course transforms — personalized, gamified, and built to get you shipping fast.

◉ Start the free course

← All news

Source: the-decoder.com