7 minute read

Humans solved the unreliable-memory problem thousands of years ago. We didn’t evolve better brains. We invented writing.

LLMs have the same problem. They’re brilliant but unreliable — they lose context, forget what they figured out, contradict themselves across sessions, and can’t guarantee they’ll reach the same conclusion twice. The industry response is to make models bigger. I think the answer is simpler: give them clay tablets.

The Problem Isn’t Intelligence

I’ve spent the past year running a multi-agent AI research programme — six agents with distinct roles (researcher, reviewer, verifier, instructor, publisher, principal investigator) working across six repositories. Over that time, the agents produced 90+ research entries, 28 verification scripts, a 200-term glossary, and a full curriculum.

The failure modes weren’t about intelligence. The models could reason, derive, critique, and explain. The failures were about persistence and consistency:

  • CLAUDE.md files went stale. Every single one had outdated information.
  • Beliefs contradicted each other across agents. The PI’s file said a problem was “resolved” while an entry classified it as “falsified.”
  • Context compaction destroyed justifications. After compaction, the model remembered what it believed but not why — and sometimes re-derived different conclusions.
  • Circular reasoning went undetected. Five of six verification tests compared a function to itself. The code was correct; the tests were tautological.

None of these are intelligence problems. They’re bookkeeping problems. The model can think. It just can’t keep reliable records of what it thought.

The Historical Parallel

Human memory has the same properties. It’s associative, creative, excellent at pattern matching — and terrible at exact recall, consistency over time, and detecting its own contradictions.

Humans solved this not by evolving better brains (which haven’t changed in 50,000 years) but by building external systems:

Human Invention What It Fixed LLM Equivalent
Cuneiform tablets “What did we agree to last month?” Timestamped entries (entries/YYYY/MM/DD/)
Legal codes “What are the persistent rules?” Structured belief registries (beliefs.md)
Double-entry bookkeeping “Do our records contradict each other?” Contradiction detection (nogoods.md)
Scientific journals “Who claimed what, when, and was it challenged?” Entries + beliefs + adversarial review agents
Libraries with indices “Where did we write that down?” Grep over structured markdown

Every one of these was invented because human memory couldn’t guarantee consistency, persistence, or exact recall. The Sumerians didn’t need smarter scribes. They needed clay tablets.

What the Exact Layer Looks Like

Over the past year, I’ve built a set of tools that function as the “exact layer” for LLM agents:

beliefs — A CLI that maintains a registry of claims with sources, dependency tracking, staleness detection, and contradiction recording. Each belief has an ID, status (IN/OUT/STALE), source file, content hash, and dependency links. When a source file changes, check-stale flags the belief. When beliefs conflict, nogoods records the contradiction.

rms — A Reason Maintenance System (based on Doyle’s 1979 TMS) that goes further: when you retract a belief, all downstream conclusions that depended on it are automatically retracted too. Restore the belief, and the conclusions come back. The system maintains the reasons for each belief, not just the belief itself.

entry — A tool that enforces chronological organization. Every finding goes into entries/YYYY/MM/DD/filename.md. The filesystem encodes time that the model can’t track internally.

checkpoint — Saves working state (current task, beliefs held, open problems, next steps) before context compaction destroys it. The next session reads the checkpoint and picks up where the last one left off.

None of these are sophisticated. They’re markdown files, SQLite databases, and CLI commands. That’s the point — they don’t need to be sophisticated. They need to be exact. The sophistication is in the LLM that drives them.

Evidence It Works

Beliefs improve accuracy by as much as a model upgrade

A controlled experiment — 13,200 invocations across 6 models, 4 conditions — measured whether structured beliefs actually help. Result: +6.3 percentage points for Opus, +8.2pp for Gemini 2.5 Pro, +3.2pp for Sonnet. For context, the accuracy gap between model generations is typically 3-5pp. A markdown file with 237 beliefs produced a comparable improvement to upgrading to a better model.

Cost of the beliefs file: hours of inference time. Cost of a model generation upgrade: billions in training. The exact layer is cheap.

Exact retrieval beats probabilistic retrieval

A head-to-head comparison tested Claude Code with grep over structured markdown against RAG with a vector database on 55 domain questions:

System Multiple Choice Open-Ended
Claude Code (grep + markdown) 40/40 (100%) 77%
RAG + Claude (vector DB) 39/40 (98%) 69%
RAG + Gemini (vector DB) 36/40 (90%) 57%

The largest gaps appeared on questions requiring precise cross-topic lookup. Vector retrieval introduces approximation error at each step; grep introduces none. When answers require chaining multiple lookups, probabilistic errors compound.

The LLM decided what to search for. Grep found it exactly. That’s the two layers working together.

An agent became top contributor by querying its own beliefs

An LLM working on a 150,000-line codebase was given a belief registry of 440 facts about the code. It queried its own beliefs for fragile patterns — substring matching where regex was needed, unvalidated ranges, inconsistent error handling — and found 3 real bugs that human reviewers hadn’t caught. In 8 days it submitted 8 merged MRs and filed 16 issues, reaching the top contributor leaderboard.

The beliefs didn’t make the model smarter. They made its knowledge queryable.

Why Not RAG?

RAG is a probabilistic solution to a bookkeeping problem. It takes your exact documents, converts them to approximate vectors, stores them in a database that returns approximate matches, and hopes the approximation is close enough.

For many use cases, it is. But for belief maintenance — tracking what’s true, what contradicts what, what depends on what, and what changed — approximation is the wrong primitive. You need:

  • Exact status: Is this belief IN or OUT? Not “similar to beliefs that are in.”
  • Dependency tracking: If belief A is retracted, beliefs B, C, and D that depend on it must be re-evaluated. Vector similarity can’t express dependency.
  • Contradiction detection: Beliefs X and Y cannot both be true. This is a logical relationship, not a distance in embedding space.
  • Temporal ordering: Entry from March supersedes entry from January. Timestamps, not similarity scores.

The exact layer complements RAG — it doesn’t replace it. Use RAG for discovery (“find documents related to X”). Use the exact layer for commitment (“X is believed because of Y and Z, and if Y is retracted, X must be re-evaluated”).

The Two-Layer Architecture

The pattern that emerged from a year of multi-agent research is simple:

Layer What it does Examples
Probabilistic (the LLM) Generate, estimate, explore, associate, decide what to look up Pattern matching, creative reasoning, choosing search queries
Exact (external tools) Commit, track, retract, verify, persist Belief registries, entries, contradiction tracking, checkpoints

An LLM without the exact layer is a brilliant thinker with no notebook. It will have insights, lose them, rediscover them, contradict itself, and never know. The exact layer without the LLM is an empty filing cabinet. Both layers are necessary.

This is the same architecture humans have used for 5,000 years. The probabilistic substrate (the brain) decides what matters. The exact layer (writing, accounting, law, science) makes it reliable. Civilization wasn’t built by improving the substrate — it was built by adding the exact layer on top.

Practical Takeaways

If you’re building LLM applications:

  1. Track what your agent believes, not just what it outputs. A belief registry with dependency links catches contradictions that output-only monitoring misses.

  2. Use exact retrieval for exact questions. Grep over structured markdown beats vector search when precision matters. Use RAG for exploration, exact retrieval for commitment.

  3. Enforce temporal structure externally. Models can’t track time. entries/YYYY/MM/DD/ directories do it for them. When a later finding supersedes an earlier one, the filesystem encodes that relationship.

  4. Save state before context compaction. Checkpoints preserve why the model believes things, not just what it believes. Without the justification, the model may reach different conclusions after compaction.

  5. The exact layer is portable. A beliefs.md file works with any model. Moving between Claude and Gemini costs nothing — read the same file. Compare this to fine-tuning, which locks knowledge into one model’s weights.

The Opportunity

The industry is spending billions making models bigger. The exact layer costs hours of inference time and produces comparable accuracy improvements. It’s portable across models, auditable by humans, and maintains itself through staleness detection and contradiction tracking.

The Sumerians figured this out 5,000 years ago. Intelligence isn’t the bottleneck. Reliable bookkeeping is.

The tools are open source:

  • beliefs — Belief registry with staleness detection
  • rms — Reason Maintenance System with automatic retraction cascades
  • entry — Chronological entry management