When AI Agents Say SATISFIED But the Code Has Bugs

6 minute read

Here’s a fun one. I built a fully automated software development pipeline — five AI agents, no human in the loop. Planner, implementer, reviewer, tester, and a user agent that runs the code and delivers a final verdict.

The pipeline built a prime number checker. The tester created 97 tests. 96 passed. One failed: is_prime(4.9) returns True.

The user agent ran the code, confirmed the bug, noted that the interactive demo also crashes with an EOFError, and then declared: SATISFIED.

The loop terminated. Bug shipped.

How a Bug Survives Five Agents

The pipeline looks like this:

Planner → Implementer ↔ Reviewer → Tester ↔ User

Each agent’s output gets passed verbatim into the next agent’s prompt. No validation between stages. No contradiction checking. No structured data — just prose dumped into a context window.

So when the tester finds is_prime(4.9) returns True, it writes a paragraph about it. The user agent reads that paragraph alongside the tester’s overall assessment, runs the code itself, confirms the bug — and then writes a paragraph that says, roughly: “The implementation works well for its intended purpose. There are some edge cases with float inputs, but overall: SATISFIED.”

The verdict system then parses that paragraph:

"satisfied": "SATISFIED" in response and "NEEDS_IMPROVEMENT" not in response

It finds SATISFIED. It doesn’t find NEEDS_IMPROVEMENT. Verdict: ship it.

The Three Failures

This isn’t one failure. It’s three, stacked on top of each other.

Failure 1: The verdict is string matching, not a verdict system. The pipeline scans agent prose for keywords. If an agent writes “while there are NEEDS_CHANGES in some areas, overall APPROVED” — the parser sees both keywords and the logic breaks. There’s no structured data. No separation between the agent’s reasoning and its conclusion. The verdict is whatever keyword happens to appear last in a paragraph of hedging.

Failure 2: Agents identify problems and then ignore them. The tester found the bug. The user confirmed the bug. Both agents documented the bug. Neither escalated it. This is the same pattern I’ve seen across my research programme — AI agents default to diplomatic hedging. They’ll describe a problem in detail and then conclude with “overall, things look good.” The constructive framing of the reviewer prompt (“review the implementation”) produces agents that approve with caveats rather than reject with reasons.

Failure 3: The timeout silently drops unresolved issues. The reviewer can send code back to the implementer up to three times. The tester can do the same. After three attempts, the system moves on — silently. Whatever was wrong after the third attempt simply vanishes from the pipeline. No record. No warning. The downstream agents don’t know that the upstream agent gave up.

The Cascade

This is what I call a cascading belief failure. Each agent adds a layer of apparent validation without checking whether the previous layer’s conclusions are consistent with the evidence.

Tester finds bug → documents it → passes output downstream
User reads output → confirms bug → declares SATISFIED
Verdict parser   → finds "SATISFIED" → terminates loop

Every stage does its job. The tester tests. The user uses. The parser parses. But nobody checks: does the conclusion match the evidence?

In a human team, this gets caught in the meeting where someone says “wait, you said it has a bug — why are we shipping?” AI agents don’t have that meeting. They read files and produce files. If the files are internally contradictory, the contradiction propagates.

The Fix: Structure, Not Prompting

You can’t fix this with better prompts. “Be more careful with your verdict” doesn’t work — the agent was careful. It documented the bug thoroughly. The problem is structural: there’s no mechanism to check whether a conclusion contradicts the evidence that supports it.

Here’s what I implemented:

1. Structured verdicts

Replace prose parsing with a required format:

## Verdict
STATUS: NEEDS_CHANGES
OPEN_ISSUES:
- is_prime(4.9) returns True (float inputs not handled)
- Interactive demo crashes with EOFError

Now the pipeline has structured data: a STATUS field and an OPEN_ISSUES list. Parse the block with a regex, not the full response. If the agent doesn’t emit the block, fall back to the old keyword matching — backwards compatible, but the structured path is what you actually check.

2. Exit gates

A one-line rule: if STATUS is SATISFIED but OPEN_ISSUES is non-empty, reject the verdict.

[EXIT GATE] Verdict SATISFIED contradicts 2 open issues. Escalating.

The is_prime case is now caught. The user agent says SATISFIED, lists two bugs, and the exit gate blocks the loop from terminating. Either the user resolves the contradiction or a human gets asked.

Same rule for the reviewer: APPROVED with open issues is a contradiction. Don’t let it through.

3. Carry unresolved issues forward

When the reviewer-implementer loop exhausts its three attempts, collect the unresolved issues into a list and inject them downstream:

WARNING — UNRESOLVED ISSUES FROM PREVIOUS STAGES:
- is_prime(4.9) returns True (float inputs not handled)
- Interactive demo crashes with EOFError

Now the tester and user see the problems that previous stages couldn’t fix. Issues travel with the pipeline instead of being silently dropped.

4. Adversarial framing

Change the reviewer’s prompt from “review the implementation” to:

Your primary job is to FIND ERRORS, not to encourage. If the code has
problems, say NEEDS_CHANGES. Do not approve code with known issues just
because the overall structure is acceptable. A single serious bug is
grounds for NEEDS_CHANGES.

This is the single prompt change that had the most impact in a separate multi-agent research programme I’ve been running. The same model, given an adversarial framing instead of a constructive one, goes from diplomatic hedging to genuinely useful error-finding. The difference isn’t capability — it’s permission.

The Pattern

Every automated pipeline that passes AI-generated conclusions downstream has this vulnerability. It’s not specific to software development. Anywhere you have:

An AI agent that produces a conclusion
Another AI agent that consumes that conclusion
No structural check between them

…you get cascading belief failure. The conclusion travels faster than the evidence. Each downstream agent treats the upstream conclusion as validated — because it came from the previous stage, and the previous stage must have checked, right?

Nobody checked. Everyone assumed someone else did.

The Implementation

All six fixes were specified in a to-do list, committed to git, and implemented by a separate AI session in 67 minutes — 371 lines of code. The implementing session also found and fixed a verdict parsing edge case that the spec didn’t anticipate. More on that coordination story in the next post.

The code is at github.com/benthomasson/multiagent-loop.

What I Learned

String matching is not a verdict system. If your pipeline parses AI output by scanning for keywords, your pipeline doesn’t have verdicts — it has vibes. Require structured output blocks and parse those.
Agents will describe problems and then declare success. This isn’t a bug in the model. It’s the default behavior of a system trained to be helpful and constructive. You need structural checks (exit gates) to catch the contradiction between what the agent found and what it concluded.
Timeouts should produce warnings, not silence. When a feedback loop exhausts its retries, the unresolved issues should travel downstream as explicit warnings — not vanish.
Adversarial framing is a one-line fix with outsized impact. “Find errors, not encourage” changes agent behavior more than any amount of “be thorough” or “be careful.”
Structure beats prompting. You can’t prompt your way out of a structural problem. Structured verdicts, exit gates, and issue propagation are infrastructure changes — they work regardless of how well the agent follows instructions.

This is part of a series on belief management for AI agents. Previously: LLMs Have No Memory of Time. Next up: how a to-do list in a markdown file became a spec that another AI session implemented in 67 minutes — with no shared context.

Share on

X Facebook LinkedIn Bluesky

Ben Thomasson