AI Coding Agents Find the File, Miss the Lines
A new benchmark, SWE-Explore, finds AI coding agents pick the right file but catch only 14–19% of the exact lines that actually need fixing.
Evgenii Arsentev · PhDAI coding agents reliably find the right file to fix but catch only 14–19% of the specific lines that actually matter, according to SWE-Explore, a new benchmark from an international team including Shanghai Jiao Tong University. The study is the first to measure code search on its own — separate from whether the final patch works — across 848 problems in 203 open-source projects and ten programming languages.
The split is the interesting part. Ask an agent like Claude Code, Codex or OpenHands to locate a bug, and it ranks the correct file near the top almost every time. Zoom in to the exact lines that need changing, and accuracy collapses. Keyword search, the study notes, barely beats random guessing at that fine-grained level — and, tellingly, a stronger underlying model doesn't close the gap. The weakness is in how agents explore code, not in raw intelligence.
Why the missing lines break the fix
The researchers found a threshold: a repair tends to succeed only when the agent has at least 50–75% of the relevant code regions in view. Below that, even a capable model writes a confident, wrong patch — it's solving a problem it can only half-see. They also found that missing the right context hurts more than drowning in irrelevant context, which cuts against the instinct to keep an agent's input lean. To set ground truth, the team used verified solutions from GPT-5.4, Gemini 3 Pro, Claude Sonnet 4.6 and Kimi K2.6.
If you use AI to fix code, this explains a frustration you've probably felt: the agent sounds sure, edits the obvious file, and the bug survives. It wasn't being lazy — it never saw the lines where the real problem lived. The practical lesson is that the bottleneck has moved. As models get smarter, the limiting factor isn't reasoning; it's whether the agent gathered the right context before it started typing.
This matches what I see in daily use: the fixes that fail tend to fail because the model guessed at a small slice of the codebase instead of reading enough of it first. The encouraging part is that it's addressable — you can feed the context the agent won't find on its own.
When an AI fix doesn't take, stop re-prompting the same way and instead hand it the context it's missing: paste the related files, the failing test, and the function that calls the broken one. Ask it to explain where the bug lives before it writes a patch — if the explanation is vague, it hasn't found the right lines yet, and any fix will be a guess. Treat the agent as a fast junior who's brilliant once it's looking at the right place, and blind until then.

Author
Evgenii Arsentev
PhD · Chief Product Officer at a healthtech company
Want to actually build this?
Guides explain. The free course transforms — personalized, gamified, and built to get you shipping fast.
◉ Start the free courseSource: the-decoder.com