The uncomfortable thing about coding agents is that they can look brilliant right up until the project starts behaving like a real project.
Give an agent a loose greenfield task and it can often produce something that works. A route responds. A test passes. A demo appears. Everyone relaxes too early.
Then the boring constraints arrive.
Use this architecture. Keep this ORM pattern. Respect this database model. Do not invent a new folder structure. Do not bypass the existing abstraction. Keep the API contract stable. Add the feature inside the shape of the system that already exists.
That is where the magic starts to wobble.
The paper names the failure mode
A new arXiv paper called “Constraint Decay: The Fragility of LLM Agents in Backend Code Generation” studies exactly this problem. The authors evaluate agents across 80 greenfield backend generation tasks and 20 feature-implementation tasks, spanning eight web frameworks, with a unified API contract and both behavioral tests and static verifiers.
The important result is not that agents fail sometimes. Everyone who actually uses them knows that.
The important result is that failure increases as structural requirements accumulate. The paper says capable configurations lose about 30 points on average in assertion pass rates from the loose baseline to fully specified tasks, while weaker configurations can get close to zero.
That is a serious warning because production software is mostly accumulated constraint.
Real systems are not blank files waiting for a clever answer. They are history, naming conventions, migration rules, logging expectations, error paths, permissions, data models, and weird decisions made three years ago because somebody had to ship on a Friday.
Agents do not just need to write code. They need to preserve the shape of the system while changing it.
Frameworks change the difficulty
One of the useful details in the paper is framework sensitivity. Agents do better in minimal, explicit frameworks like Flask and worse on average in convention-heavy environments like FastAPI and Django.
That tracks with lived experience.
The more a framework depends on implicit structure, conventions, generated behavior, and ecosystem-specific taste, the harder it becomes for the model to know whether a solution is merely plausible or actually native to the project.
This is why coding-agent benchmarks can be misleading. A task can be functionally correct and still be architecturally wrong. It can pass a happy-path test while quietly corrupting the data layer. It can satisfy the prompt and still leave a human maintainer with a mess.
The paper’s error analysis points directly at that: data-layer defects, incorrect query composition, and ORM runtime violations show up as leading root causes.
That is not a small footnote. The data layer is where backend code stops being text and starts being business reality.
The benchmark should get meaner
The next useful coding-agent benchmark should not ask only whether the output works.
It should ask whether the output belongs.
Did the agent keep the existing architecture? Did it use the same validation pattern? Did it avoid inventing a second way to talk to the database? Did it preserve migrations? Did it handle permissions and failure cases? Did it add tests in the right layer? Did it make the future maintainer’s life easier or just create a working object that now needs adult supervision?
That is the real frontier for coding agents.
Not “can it write code?”
Can it obey accumulated constraints under pressure?
What builders should do now
The practical response is not to stop using agents. That would be silly. The productivity gains are real.
The response is to make constraints executable.
Turn architecture rules into tests where possible. Add static checks for the patterns you care about. Keep database expectations explicit. Give the agent narrower tasks. Make the review path focus on integration, not just syntax. Prefer smaller, sharper instructions over broad “build this feature” prompts when the repo has real history.
Most importantly, stop treating the agent’s first answer as a product and start treating it as a proposal.
The coding-agent story is still moving fast, but this paper points to the line between demo value and production value. The demo is output. The product is constraint discipline.
That is a much harder problem.
And a much more useful one.
Sources: arXiv, Hacker News front page, May 24, 2026