7 minute read

Every stage in a multi-agent pipeline can produce an incorrect answer. Empirically, LLM synthesis steps produce errors 14–40% of the time. Call it 15% for round numbers: each stage has an 85% chance of producing a correct answer.

Without filtering, errors compound multiplicatively:

Stages P(correct)
1 85.0%
2 72.3%
3 61.4%
4 52.2%
5 44.4%

By stage 5, a majority of pipeline runs contain at least one error. This is the dirty pipeline: holes in the pipe let dirt in at every stage, and the longer the pipe, the dirtier the water.

And this is the optimistic version — it assumes errors don’t compound. In practice, an incorrect intermediate result often makes downstream stages more likely to err, not just independently wrong.

The Filter

One solution: put a filter at each stage. The filter has an 85% chance of detecting dirty water and dumping the batch before it progresses. This changes the problem from “do we get correct answers?” to “do we get an answer at all?”

At each stage:

Outcome Probability Result
Correct + filter passes 85% × 85% = 72.25% Good output
Incorrect + filter catches 15% × 85% = 12.75% Correctly dumped
Incorrect + filter misses 15% × 15% = 2.25% Bad output (leak)
Correct + filter dumps 85% × 15% = 12.75% Wasted good work

What passes through the filter is now 72.25 / (72.25 + 2.25) = 97% correct. But only 74.5% of batches survive each stage.

The tradeoff flips from accuracy to throughput. You get much cleaner output, but some batches get dumped and need to be retried.

The Ram Pump

This is analogous to a hydraulic ram pump. A ram pump wastes 80–90% of the water flowing through it to push a small amount to a much higher elevation. The wasted water is the energy source.

Same structure here: the wasted tokens on correct-but-dumped answers are the energy cost of pumping up the probability. You can’t get 97% output quality without paying the 25.5% throughput tax per stage. The waste isn’t a bug — it’s the mechanism.

This reframes the cost question for multi-agent systems. When you see a system that “throws away good work” to verify outputs, it’s not being inefficient. It’s a ram pump. Without that waste, you get the dirty pipeline — 44% correct after 5 stages. With the ram pump filter, you get ~86% correct after 5 stages. You just need to run more batches.

The nice thing about tokens vs water: you can always pump more tokens. Retry is cheap. Contaminated output is expensive.

Recycling the Dumped Water

The ram pump analogy breaks down in one important way: dumped water is lost, but dumped answers are not. We have the rejected output and the reason it was rejected. This opens a recovery path.

The filter doesn’t just say “bad” — it can categorize the failure. From empirical data across six knowledge domains, we see the same four failure modes repeatedly:

  1. Smuggled premises — the synthesis assumes something not in the inputs
  2. Unsupported superlatives — “the most important,” “always,” “never” without evidence
  3. False causal links — correlation presented as causation, or mechanism invented
  4. Domain conflation — concepts from one domain incorrectly applied to another

Each failure mode has a different recovery strategy. Smuggled premises can be fixed by making the missing premise explicit. Unsupported superlatives can be weakened to hedged claims. False causal links can be downgraded to correlations. Domain conflation can be caught by checking whether the inputs actually span the domains being connected.

So instead of dumping batches permanently, you can classify the failure, attempt targeted repair, and re-filter. This costs more tokens but recovers throughput.

The Iterative Pump

A chatbot gets one shot. It runs the pipeline, returns the answer, and moves on. The dirty pipeline math is the best it can do.

A persistent knowledge base is different. You can reprocess the same domain again and again, each round running generate-then-review over the existing network:

  1. Generate new candidate knowledge from existing knowledge (~85% correct)
  2. Review filters them (~85% detection rate)
  3. Surviving knowledge joins the network
  4. Rejected items get classified and some get repaired
  5. Next round generates from the improved network

This is a fixed-point iteration. Each round pumps up the proportion of correct knowledge in the network, at the cost of more tokens. I wrote a simulation to find the fixed points:

Scenario Max Quality Rounds
No filter (dirty pipeline) 85.4% 39
Filter only (dump and discard) 97.2% 20
Filter + recovery 97.6% 17
Filter + review existing items 98.0% 6
Full system (filter + recovery + review existing) 98.2% 5

The filter alone is the biggest lever: +12pp. Reviewing existing items converges 3x faster than recovery because it cleans the substrate that future rounds draw from, not just the current batch.

Sensitivity: What If the Accuracy Is Different?

The 85% rate is an empirical midpoint. But what if the pipeline stages are worse or better? Using the same percentage for both generation accuracy and filter detection:

Configuration p=60% p=85% p=90% p=95% p=99%
No filter 60.6% 85.4% 90.3% 95.2% 99.1%
Filter only 70.0% 97.2% 98.9% 99.8% ~100%
Full system 95.8% 98.2% 99.3% 99.8% ~100%

At 60% accuracy — worse than a coin flip on hard tasks — the full system still pumps quality to 95.8%. At 99%, the architecture is unnecessary. The biggest advantage is at low accuracy rates, which matches empirical findings that this architecture helps weaker models more than stronger ones.

How many pipeline stages before you drop below 50%?

Accuracy Stages to <50%
60% 2
85% 5
90% 7
95% 14
99% 69

At 60%, two stages and you’re below a coin flip. At 85%, five stages. Any multi-step agent pipeline needs to reckon with this math.

When Is This Architecture Unnecessary?

If LLMs reach 99% correct on synthesis tasks, the dirty pipeline doesn’t bite until stage 69, and the filter adds almost nothing. This entire architecture becomes unnecessary.

But consider what 99% means: out of every 100 synthesis steps, at most 1 produces an error. Today’s frontier models are at 60–85% on domain synthesis tasks. Getting from 85% to 99% requires a ~10x reduction in error rate.

For comparison, software defect rates dropped from ~25 per KLOC in the 1970s to ~1–5 per KLOC with modern practices — roughly 10x over 40 years. These are hard-won gains.

The filter architecture works today at current error rates. And here’s the key insight: this architecture may be how you get to 98% in practice. The simulation shows an 85% model with the full system converging to 98.2%. You get near-99% system performance without building a 99% model — you build a system around an 85% model that iteratively corrects itself. The engineering effort shifts from “make the model better” to “make the pipeline smarter.”

Always Filter After Synthesis

What about a system that queries a pre-built 98% knowledge base to answer user questions? The synthesis step — turning stored knowledge into a user-facing answer — is an 85% pipeline stage. Without a filter:

98% × 85% = 83.3% chance the answer is correct.

All the work to build a 98% knowledge base, and one unfiltered synthesis step drops you to 83%.

Add a filter:

What gets through: 96.6% correct. The cost: ~1 in 4 queries needs a retry.

The principle: always filter after synthesis. Any time you do a synthesis step, an unfiltered output throws away the quality you’ve accumulated. The filter costs tokens and latency, but without it you’re serving dirty water from clean pipes.

There Is No Minimum

You might expect that below 50% accuracy — worse than a coin flip — the pump would fail to converge. It doesn’t:

Accuracy Fixed Point
30% 78.9%
40% 90.0%
50% 95.3%
85% 98.2%

Even at 30% accuracy, the system converges to 78.9%. It just takes a lot more rounds and tokens to get there.

The pump always works because the filter is asymmetric: it catches errors and removes them from the network, but correct items that pass through stay permanently. As long as generation produces any correct output and the filter catches any errors, the ratio tilts toward correct over enough rounds.

The real question is economics, not convergence. At 30%, you spend 40x more tokens than at 85%. At some point, you’re better off using a better model than pumping a terrible one. But the mathematical guarantee is there: the pump works at any accuracy above zero.

Implications for Multi-Agent Systems

If you’re building multi-agent pipelines — chains of LLM calls, tool use sequences, agentic workflows — the dirty pipeline math applies directly:

  1. Every unfiltered stage leaks errors. Five stages at 85% = 44% correct. This isn’t a quality concern; it’s a majority-failure-rate concern.

  2. Filters are more valuable than better models. Adding an 85% filter to an 85% pipeline stage gives you 97% — a bigger improvement than upgrading the model from 85% to 95% would give you without a filter.

  3. Always filter after synthesis. Whether you’re building a knowledge base or answering a user question, an unfiltered synthesis step throws away upstream quality.

  4. Persist filtered knowledge. A chatbot gets one shot through the pipeline. A persistent knowledge base can iterate, pumping accuracy up to 98%+ over multiple rounds. Build the knowledge base once; query it many times.

  5. The waste is the mechanism. When you see a system that “wastes” tokens verifying and occasionally discarding good work, it’s not being inefficient. It’s a ram pump. The waste is what pushes quality upward.