Bayer Trusts AI Agents With Decades of Drug Research

A Thoughtworks teardown of PRINCE, built with Bayer, shows reliable agentic AI comes from engineering the harness around the model, not from a smarter model.

↻ Published 2026-06-21◷ 5 min readEA

Evgenii Arsentev · PhD

New posts every dayFollow me on TelegramWhere the AI world lives — daily AI news + Claude Code tipsFollow →Free Claude Code courseNo upsells, no cross-sells — nothing to buy here.Start free →

A detailed engineering account of a production AI system at Bayer argues that reliability in agentic AI comes mostly from disciplined engineering, not from a better model. Written by Sarang Sanjay Kulkarni, a principal consultant at Thoughtworks, the piece walks through PRINCE — a Preclinical Information Center built with Bayer to help researchers navigate decades of safety-study reports. It's a rare look at what an agent system actually requires once it has to work for real users, every day, on data that matters.

PRINCE grew through three phases that map neatly onto how most teams adopt this tech. First Search: metadata filtering over the reports. Then Ask: retrieval-augmented generation so researchers could ask questions in plain language. Then Do: multiple specialized agents that plan, research, reflect and write to carry out multi-step tasks. Each phase added capability — and each added new ways to fail.

The harness matters more than the model

The most quotable line is also the thesis: 'reliability comes from engineering both the context the model sees and the harness within which the model acts.' Two ideas carry the weight. Context engineering means deliberately routing different information to different agents at different stages — planning context for the planner, retrieval context for the researcher, evidence context for the reflection step, synthesis context for the writer — instead of stuffing everything into one giant prompt and hoping. Harness engineering is the scaffolding around the model: orchestration, tool boundaries, state persistence, retries, fallbacks, validation, reflection loops, observability and human review.

The system uses three distinct kinds of reflection: process reflection during planning, data reflection to check whether retrieved evidence is actually sufficient, and draft reflection to confirm the final answer is complete. For resilience, state lives in PostgreSQL, retries fire automatically at both the model and node level, users can re-run from the exact point of failure, and the system falls back across LLM providers when one stumbles.

The RAG pipeline is just as concrete: keyword extraction plus metadata filtering, query expansion into five semantic variants, weighted hybrid search at 0.7 semantic and 0.3 keyword, then cross-encoder reranking that narrows roughly 20 retrieved chunks down to the best 7. SQL queries get up to three retries; record fetches are capped at 50 per query. None of this is glamorous, and that's the point.

Why it matters for you

If you're trying to ship something with AI agents and it keeps almost working, this is the playbook for the last mile. The lesson is that the gap between a slick demo and a trustworthy product is rarely closed by upgrading the model — it's closed by controlling what each step sees and building workflows you can observe, retry and recover. That's encouraging, because it's engineering you can do, not a frontier model you have to wait for.

My take: the most useful habit hiding in here is splitting one bloated prompt into stage-specific context. I've watched my own agent setups get worse as I crammed more into a single instruction; giving each step only what it needs is the cheapest reliability win available, and you can apply it today with whatever model you already use.

ℹWhat I'd actually do

Take one flaky agent workflow you have and do two things: give each stage its own narrow context instead of one mega-prompt, and add a reflection step that checks whether the retrieved evidence is actually enough before answering. Those two changes mirror what carried PRINCE to production — and neither requires a new model.

#agentic AI#LLM engineering#RAG#production AI

Related guides

Author

Evgenii Arsentev

PhD · Chief Product Officer at a tech company

About the author →

Want to actually build this?

Guides explain. The free course transforms — personalized, gamified, and built to get you shipping fast.

◉ Start the free course

← All news

Source: martinfowler.com