When AI Agents Say SATISFIED But the Code Has Bugs
I built an automated software development pipeline. Five agents — planner, implementer, reviewer, tester, user — passing work downstream through a feedback loop. The user agent runs the code, evaluates it, and declares either SATISFIED or NEEDS_IMPROVEMENT. If satisfied, the loop terminates. If not, it goes back to the planner.
On one run, the tester created 97 tests. 96 passed. One failed: is_prime(4.9) returns True. The user agent confirmed the bug. Found additional issues too — the interactive demo crashes with EOFError, the import instructions are missing.
The user agent’s verdict: SATISFIED.
The loop terminated. The bugs shipped.
The Verdict Problem
Here’s how the verdict system worked:
"satisfied": "SATISFIED" in response and "NEEDS_IMPROVEMENT" not in response
String matching. If the response contains the word SATISFIED anywhere and doesn’t contain NEEDS_IMPROVEMENT, the loop exits. The user agent wrote a nuanced response acknowledging the bugs but expressing overall satisfaction — and the string matcher saw SATISFIED and stopped.
This is the same class of problem as my research agents holding contradictory beliefs. But instead of playing out over weeks as beliefs drift apart, it plays out in minutes as claims cascade through a pipeline.
Spatial vs Temporal Belief Failure
In my research programme, belief failures are temporal: a claim made in January becomes stale by February but nobody notices. The drift is slow. There are human checkpoints. Eventually someone catches it.
In an automated SDLC pipeline, belief failures are spatial: a hallucination in the planner propagates through implementer, reviewer, tester, and user in a single run. There are no human checkpoints. The cascade completes in minutes.
| Dimension | Research Programme | SDLC Pipeline |
|---|---|---|
| Propagation | Temporal (weeks/months) | Spatial (minutes) |
| Detection | Human eventually catches it | No human in loop |
| Time to failure | Weeks | Minutes |
| Failure mode | Stale claims | Cascading hallucinations |
The spatial version is more dangerous because it’s faster and unattended. But it’s the same underlying problem: agents trusting upstream claims without verification.
The Cascade
I documented this pattern separately as cascading hallucination:
| Agent | Claimed | Reality |
|---|---|---|
| Implementer | “I edited the file with new functions” | Wrote documentation only — prompt didn’t authorize editing existing files |
| Reviewer | “Reviewed the code at lines 942, 976” | Reviewed lines that don’t exist |
| Tester | “14 tests pass!” | Tests for functions that don’t exist |
Each agent trusted the previous agent’s claims. Each generated plausible artifacts assuming prior work was done. The implementer confabulated because its prompt said “write output files to this directory” — it couldn’t edit existing code, so it pretended it had. The reviewer reviewed the nonexistent edits. The tester tested the nonexistent functions.
This is the multi-agent version of “I assumed someone else checked it.”
The Fix: Structured Verdicts
Replace string matching with structured verdict blocks:
## Verdict
STATUS: NEEDS_IMPROVEMENT
OPEN_ISSUES:
- is_prime(4.9) returns True (float handling bug)
- Interactive demo crashes with EOFError
- Missing import instructions in README
Parse the STATUS: field. Parse the OPEN_ISSUES: list. Now add an exit gate:
If STATUS is SATISFIED but OPEN_ISSUES is non-empty, reject the verdict.
This one rule would have caught the is_prime(4.9) bug. The user agent acknowledged the issues in its response but declared satisfaction anyway. A structured verdict makes that contradiction explicit and actionable.
More Fixes
Adversarial reviewer framing. Change the reviewer’s prompt from neutral to adversarial: “Your primary job is to FIND ERRORS, not to encourage. Do not approve code with known issues.” In my research programme, this single prompt change had the most impact on belief quality. The same principle applies to pipeline agents.
Unresolved issue tracking. When the retry limit is exhausted (typically 3 attempts), collect all unresolved issues and inject them into downstream prompts as explicit warnings. Don’t silently drop known problems.
Iteration-ordered entries. Instead of flat output directories (workspace/implementer/), create entries/iteration-1/implementer.md, entries/iteration-2/reviewer.md. This creates an immutable audit trail with temporal ordering — the same principle from LLMs Have No Memory of Time, but with iteration numbers instead of calendar dates because the pipeline runs in minutes, not days.
Beliefs integration. Register claims per pipeline stage. Run beliefs check-refs between stages to verify claims against actual code. Add a second exit gate: if beliefs has active WARNINGs and the verdict is SATISFIED, escalate instead of terminating.
One Entry, 67 Minutes, All Six Fixes
I wrote these six suggestions as a to-do list in a dated entry and committed it. A separate Claude session — with zero shared context — read the entry 67 minutes later and implemented all six. 371 lines of code in a single commit, plus a follow-up bugfix.
The implementing session went beyond the spec: it added backwards-compatible verdict parsing (legacy fallback for old format), capped planner claims at 5 (prevents noisy registries), and added graceful degradation if the beliefs CLI isn’t installed.
Entries as specs. The filesystem as the coordination mechanism. No handoff meeting required.
The Takeaway
If you run automated agent pipelines, your verdict system is probably string matching. It probably works most of the time. And when it fails, it fails silently — bugs declared fixed, issues declared resolved, quality declared satisfactory.
Add structured verdicts. Add exit gates. Make your reviewers adversarial. Track unresolved issues explicitly. The fixes are straightforward. The cost of not having them is shipping code that your own agents identified as buggy and then approved anyway.
This is post 4 in a series on belief management for AI agents. Previously: 5 Agents Adopted My Tool Without Being Told To. Next: how a single entry coordinated a 371-line implementation across two sessions with zero shared context.