When AI Agents Say SATISFIED But the Code Has Bugs

4 minute read

I built an automated software development pipeline. Five agents — planner, implementer, reviewer, tester, user — passing work downstream through a feedback loop. The user agent runs the code, evaluates it, and declares either SATISFIED or NEEDS_IMPROVEMENT. If satisfied, the loop terminates. If not, it goes back to the planner.

On one run, the tester created 97 tests. 96 passed. One failed: is_prime(4.9) returns True. The user agent confirmed the bug. Found additional issues too — the interactive demo crashes with EOFError, the import instructions are missing.

The user agent’s verdict: SATISFIED.

The loop terminated. The bugs shipped.

The Verdict Problem

Here’s how the verdict system worked:

"satisfied": "SATISFIED" in response and "NEEDS_IMPROVEMENT" not in response

String matching. If the response contains the word SATISFIED anywhere and doesn’t contain NEEDS_IMPROVEMENT, the loop exits. The user agent wrote a nuanced response acknowledging the bugs but expressing overall satisfaction — and the string matcher saw SATISFIED and stopped.

This is the same class of problem as my research agents holding contradictory beliefs. But instead of playing out over weeks as beliefs drift apart, it plays out in minutes as claims cascade through a pipeline.

Spatial vs Temporal Belief Failure

In my research programme, belief failures are temporal: a claim made in January becomes stale by February but nobody notices. The drift is slow. There are human checkpoints. Eventually someone catches it.

In an automated SDLC pipeline, belief failures are spatial: a hallucination in the planner propagates through implementer, reviewer, tester, and user in a single run. There are no human checkpoints. The cascade completes in minutes.

Dimension	Research Programme	SDLC Pipeline
Propagation	Temporal (weeks/months)	Spatial (minutes)
Detection	Human eventually catches it	No human in loop
Time to failure	Weeks	Minutes
Failure mode	Stale claims	Cascading hallucinations

The spatial version is more dangerous because it’s faster and unattended. But it’s the same underlying problem: agents trusting upstream claims without verification.

The Cascade

I documented this pattern separately as cascading hallucination:

Agent	Claimed	Reality
Implementer	“I edited the file with new functions”	Wrote documentation only — prompt didn’t authorize editing existing files
Reviewer	“Reviewed the code at lines 942, 976”	Reviewed lines that don’t exist
Tester	“14 tests pass!”	Tests for functions that don’t exist

Each agent trusted the previous agent’s claims. Each generated plausible artifacts assuming prior work was done. The implementer confabulated because its prompt said “write output files to this directory” — it couldn’t edit existing code, so it pretended it had. The reviewer reviewed the nonexistent edits. The tester tested the nonexistent functions.

This is the multi-agent version of “I assumed someone else checked it.”

The Fix: Structured Verdicts

Replace string matching with structured verdict blocks:

## Verdict
STATUS: NEEDS_IMPROVEMENT
OPEN_ISSUES:
- is_prime(4.9) returns True (float handling bug)
- Interactive demo crashes with EOFError
- Missing import instructions in README

Parse the STATUS: field. Parse the OPEN_ISSUES: list. Now add an exit gate:

If STATUS is SATISFIED but OPEN_ISSUES is non-empty, reject the verdict.

This one rule would have caught the is_prime(4.9) bug. The user agent acknowledged the issues in its response but declared satisfaction anyway. A structured verdict makes that contradiction explicit and actionable.

More Fixes

Adversarial reviewer framing. Change the reviewer’s prompt from neutral to adversarial: “Your primary job is to FIND ERRORS, not to encourage. Do not approve code with known issues.” In my research programme, this single prompt change had the most impact on belief quality. The same principle applies to pipeline agents.

Unresolved issue tracking. When the retry limit is exhausted (typically 3 attempts), collect all unresolved issues and inject them into downstream prompts as explicit warnings. Don’t silently drop known problems.

Iteration-ordered entries. Instead of flat output directories (workspace/implementer/), create entries/iteration-1/implementer.md, entries/iteration-2/reviewer.md. This creates an immutable audit trail with temporal ordering — the same principle from LLMs Have No Memory of Time, but with iteration numbers instead of calendar dates because the pipeline runs in minutes, not days.

Beliefs integration. Register claims per pipeline stage. Run beliefs check-refs between stages to verify claims against actual code. Add a second exit gate: if beliefs has active WARNINGs and the verdict is SATISFIED, escalate instead of terminating.

One Entry, 67 Minutes, All Six Fixes

I wrote these six suggestions as a to-do list in a dated entry and committed it. A separate Claude session — with zero shared context — read the entry 67 minutes later and implemented all six. 371 lines of code in a single commit, plus a follow-up bugfix.

The implementing session went beyond the spec: it added backwards-compatible verdict parsing (legacy fallback for old format), capped planner claims at 5 (prevents noisy registries), and added graceful degradation if the beliefs CLI isn’t installed.

Entries as specs. The filesystem as the coordination mechanism. No handoff meeting required.

The Takeaway

If you run automated agent pipelines, your verdict system is probably string matching. It probably works most of the time. And when it fails, it fails silently — bugs declared fixed, issues declared resolved, quality declared satisfactory.

Add structured verdicts. Add exit gates. Make your reviewers adversarial. Track unresolved issues explicitly. The fixes are straightforward. The cost of not having them is shipping code that your own agents identified as buggy and then approved anyway.

This is post 4 in a series on belief management for AI agents. Previously: 5 Agents Adopted My Tool Without Being Told To. Next: how a single entry coordinated a 371-line implementation across two sessions with zero shared context.

Share on

X Facebook LinkedIn Bluesky

Ben Thomasson

When AI Agents Say SATISFIED But the Code Has Bugs

The Verdict Problem

Spatial vs Temporal Belief Failure

The Cascade

The Fix: Structured Verdicts

More Fixes

One Entry, 67 Minutes, All Six Fixes

The Takeaway

Share on

You May Also Enjoy

LLMs Don’t Have Super-Human Intelligence, But You Can

Give Yourself Superpowers

This Blog Is Not for You, Human

Context Engineering Is Dead — Structure Your Information Instead