The Cognitive Budget Principle: Why Architecture Beats Model Size
The cheapest Claude model — Haiku, at 4% of Opus’s price — matches Opus at 95% accuracy across 3,853 expert-domain questions. A 12-billion-parameter open-weight model running on consumer hardware matches Sonnet’s cloud performance. The difference isn’t the model. It’s the architecture.
We discovered this by accident, in a single day of experiments that started with a modest question and ended with a general principle about how LLMs process information.
The Starting Point: A 5-Point Gap That Wasn’t
We’d built a knowledge system for AI agents — ftl-reasons, a structured belief network where claims have provenance, dependency chains, and contradiction records. Think of it as a curated knowledge base with relationships between facts, not just a bag of text chunks.
We compared it against standard RAG (retrieval-augmented generation — the usual approach of chunking documents and searching by vector similarity). On 50 expert-domain questions, our belief system scored 95.5% and RAG scored 90.9%. A modest 5-point gap.
That gap was misleading.
When we designed questions that required multi-hop reasoning — following chains of derived conclusions across multiple domains — the belief system scored 25/25 perfect grades while RAG scored 10 perfect grades with 2 outright failures. The general questions were too easy. They were information retrieval questions — RAG’s home turf. The hard questions exposed a structural advantage that the easy questions hid.
But the more interesting discovery was about the models themselves.
The Sonnet Anomaly
When we tested our belief system with Sonnet (Claude’s mid-tier model) instead of Opus, something strange happened. Sonnet scored 42%. Not 90%. Not 80%. Forty-two percent.
RAG with the same model scored 84%. So Sonnet could clearly synthesize well — it just couldn’t do it through our belief system’s protocol.
We tried everything. Simplified the protocol to a single pass instead of iterative search. Added source documents alongside beliefs. Stripped the structured metadata. Nothing worked. Sonnet was stuck at 34-42% across every variant we tested.
Meanwhile, Haiku — the smallest Claude model — responded to every one of those interventions. Adding source documents helped. Stripping metadata helped. Haiku moved from 40% to 76% across the ablation.
The interventions that unlocked Haiku did nothing for Sonnet. Why?
The Instruction-Following Paradox
Our belief system’s prompt included a strict refusal instruction: “If the beliefs are insufficient to answer, respond EXACTLY with: ‘I don’t have enough beliefs in the network to answer this question.’ Do NOT attempt a partial or speculative answer.”
Here’s what was happening:
- Haiku couldn’t fully process the strict instruction. It didn’t have enough cognitive capacity to parse, internalize, and comply with it. So it ignored it and attempted answers anyway. This accidentally produced better results.
- Sonnet could process and comply with the instruction. It followed it so literally that it over-refused — judging context as insufficient and refusing to answer, even when the same information produced a good answer under a more permissive prompt.
- Opus could process, comply, and exercise judgment about when to comply. It had enough headroom to follow the instruction appropriately without sacrificing synthesis quality.
The strict instruction was a fixed cognitive cost. Haiku couldn’t afford it and skipped it. Sonnet could afford it but had nothing left for the actual work. Opus could afford it and still had headroom.
This is the cognitive budget principle.
The Principle
Every LLM has a cognitive budget — the amount of useful reasoning it can perform within a single prompt-response cycle. When a task exceeds that budget, the model fails. The failure mode depends on how far over budget the task is:
- Slightly over: quality degrades gracefully
- Moderately over: the model over-complies with instructions at the expense of the actual task
- Far over: the model can’t even process the full prompt and skips parts of it
The solution isn’t a bigger model. It’s a smaller task.
Dual-Path: Decompose the Problem
Instead of asking one model to do everything in a single prompt — search a belief network, search document chunks, reconcile two knowledge representations, comply with formatting instructions, and synthesize an answer — we split it into three focused calls:
+--- Belief search → answer_1 ---+
Question ---| |--- Merge → final answer
+--- Document search → answer_2 --+
- Belief path: search structured beliefs, synthesize (hard, but focused)
- Document path: search raw document chunks, synthesize (medium, focused)
- Merge: combine two coherent answers into one (trivially easy)
No single step requires the full cognitive budget that the combined task demands. The merge step is so simple that any model handles it equally well.
The Results
| Model | Type | Single-Task | Dual-Path | Lift |
|---|---|---|---|---|
| Opus | API, frontier | 95.5% | 100% | +4.5pp |
| Sonnet | API, mid-tier | 42% | 98% | +56pp |
| Haiku | API, small | 40% | 100% | +60pp |
| gemma4:31b | local, 31B | 68% | 94% | +26pp |
| Gemini | API, frontier | 62% | 92% | +30pp |
| gemma3:12b | local, 12B | — | 82% | — |
The lift is inversely proportional to model capability. The weakest models benefit most from decomposition — because their tasks exceeded their budgets by the largest margin.
On a 50-question sample, Haiku matched Opus at 100%. At full scale — 3,853 questions — the results held:
| Model | Full-Scale A/B% | Relative Cost |
|---|---|---|
| Opus | 98% | 60x |
| Sonnet | 97% | 5x |
| Haiku | 95% | 1x |
The architecture compresses a 42-point model capability gap down to 3 points. And those last 3 points cost 60x more to close.
Why This Generalizes
The cognitive budget principle isn’t specific to our belief system or RAG. It appears at every level of LLM system design:
Chain-of-thought prompting works because it breaks one hard reasoning step into multiple steps that each fit within a single forward pass. Each token is one forward pass through the model. More tokens means more passes, each doing a small amount of work. CoT succeeds because it decomposes reasoning into budget-sized pieces.
Agentic protocols succeed for large models because each step (formulate query, call tool, interpret result) fits within their budget. The same protocol fails for smaller models because the cumulative overhead of managing the protocol exceeds their budget.
Checklists work for human cognition the same way. Gawande’s Checklist Manifesto documents how checklists don’t make pilots smarter — they decompose complex procedures into steps that fit within human cognitive budget under stress. The dual-path architecture is a checklist for LLMs.
The mechanism is always the same: don’t ask the system (human or AI) to do everything at once. Break it into pieces that fit within the system’s processing capacity.
Practical Implications
Decompose to fit, don’t upgrade to cover. When a task fails on a model, the instinct is to switch to a bigger model. The evidence says: restructure the task instead. Haiku + dual-path (95%, 1x cost) beats Opus + single-task (95.5%, 60x cost). Architecture is leverage; model size is brute force.
Estimate budget per model. Different models have different budgets. Opus handles multi-step agentic protocols with strict instructions. Sonnet handles single-step synthesis with permissive prompts. Haiku handles simple synthesis with minimal formatting overhead. Design your pipeline around the target model’s budget, not the most capable model available.
The merge pattern is general. Any task that can be decomposed into independent subtasks can use the merge pattern: run each subtask with a focused prompt, merge the results. The merge step is trivially cheap for any model because it operates on finished answers, not raw data.
Strict instructions are a regressive tax. A strict prompt instruction costs the same fixed budget regardless of model size. For large models this is negligible. For medium models it can be catastrophic. For small models it may be ignored entirely. Tune prompt strictness to match model capability.
Architecture improvements compound; model upgrades don’t. When you improve the architecture, every model benefits. When you upgrade one model, only that deployment benefits. The dual-path architecture lifted all six models we tested — across three model families, API and local.
The One-Day Arc
The entire research arc — from “these two systems seem close” to “here’s a general principle about LLM cognition validated across six models and 3,853 questions” — happened in a single day. Each finding raised a question that led to the next experiment:
TMS vs RAG gap seemed small → depth questions exposed the real gap → weaker models failed → the Sonnet anomaly → cognitive budget hypothesis → dual-path architecture → validated across all models → confirmed at scale.
No experiment was planned before the previous one’s results came in. The research direction emerged from the data. This itself is a lesson: with the right experimental platform, following each result to its natural next question produces a coherent research arc faster than planning one in advance.
The Bottom Line
The most important decision in an LLM system isn’t which model to use. It’s how to structure the task. A well-decomposed pipeline with a cheap model will consistently outperform an expensive model struggling with a task that exceeds its cognitive budget.
The model provides raw intelligence. The architecture determines how much of that intelligence actually reaches the output.