LLMs Don’t Need Bigger Models. They Need Clay Tablets. (Revised)

10 minute read

This is a revised edition of the original Clay Tablets post from March 2026. The core argument is unchanged — LLMs need external memory, not bigger parameters. But three months of building 40+ expert knowledge bases across domains produced much stronger evidence, better tools, and a clearer architecture. This revision updates the data while preserving the original thesis.

Humans solved the unreliable-memory problem thousands of years ago. We didn’t evolve better brains. We invented writing.

LLMs have the same problem. They’re brilliant but unreliable — they lose context, forget what they figured out, contradict themselves across sessions, and can’t guarantee they’ll reach the same conclusion twice. The industry response is to make models bigger. I think the answer is simpler: give them clay tablets.

The Problem Isn’t Intelligence

I’ve spent the past year running AI agents across 40+ domains — enterprise products, distributed systems, algorithms, codebases, platform architecture, certification curricula. The agents produced thousands of research entries, discovered hundreds of bugs, and built knowledge bases containing over 30,000 justified beliefs.

The failure modes were never about intelligence. The models could reason, derive, critique, and explain. The failures were about persistence and consistency:

Context compaction destroyed justification chains. After compaction, the model remembered what it believed but not why.
Beliefs contradicted each other across sessions. The same model derived different conclusions from the same evidence on different days.
Bad premises propagated silently. A plausible-but-wrong claim extracted from a source document passed every downstream check because nothing verified it against the source.
The model’s own knowledge went unused. The same model that generated a bug during coding could find that bug during review — if you asked it the right question from the right angle.

None of these are intelligence problems. They’re bookkeeping problems. The model can think. It just can’t keep reliable records of what it thought.

The Historical Parallel

Human memory has the same properties. It’s associative, creative, excellent at pattern matching — and terrible at exact recall, consistency over time, and detecting its own contradictions.

Humans solved this not by evolving better brains (which haven’t changed in 50,000 years) but by building external systems:

Human Invention	What It Fixed	LLM Equivalent
Cuneiform tablets	“What did we agree to last month?”	Justified beliefs in a SQLite database
Legal codes	“What are the persistent rules?”	Belief networks with retraction cascades
Double-entry bookkeeping	“Do our records contradict each other?”	Contradiction detection (nogoods)
Scientific journals	“Who claimed what, when, and was it challenged?”	Derive-then-review with 13-37% retraction rate
Libraries with indices	“Where did we write that down?”	Semantic search + full-text search over beliefs

Every one of these was invented because human memory couldn’t guarantee consistency, persistence, or exact recall. The Sumerians didn’t need smarter scribes. They needed clay tablets.

What the Exact Layer Looks Like

The tools have matured significantly since the original post. The core engine is now ftl-reasons — a hybrid truth maintenance system with LLM-driven semantics:

reasons — A CLI that maintains a justified belief network in SQLite. Every belief has an ID, truth value (IN/OUT), justification chain, source provenance, derivation depth, and dependency links. When a belief is retracted, all downstream conclusions cascade to OUT automatically. Restore the belief, and the conclusions come back. The system maintains the reasons for each belief, not just the belief itself.

expert-agent-builder — A pipeline that automates knowledge base construction: fetch source documents, extract beliefs, review premises against sources, derive higher-order conclusions, review derivations adversarially, retract errors with cascade propagation.

expert-service — A dual-path retrieval layer: pre-computed beliefs (TMS path) plus full-text source search (FTS path). Any model can query any expert via HTTP.

The belief network is SQLite — portable, inspectable, editable. Not embeddings, not weights, not proprietary formats. Plain text claims with justification chains in a database anyone can read.

Evidence It Works

Beliefs improve accuracy more than a model upgrade

A controlled experiment — same 50 domain questions, same model (Opus), same evaluation rubric — measured the impact of a pre-built belief network:

Condition	A-Grade	Speed
Without EEM (raw document search)	33%	350s average
With EEM (pre-computed beliefs)	88%	25s average

That’s not a marginal improvement. A $300 knowledge base produced a larger accuracy gain than any model upgrade in history.

The cheapest Claude model (Haiku) with the belief network matched the most expensive (Opus) without it — 94% vs 98% on the same questions. The architecture compresses a model-tier gap into a 4-point difference at 1/60th the per-query cost.

The model’s own knowledge, extracted and verified

The belief network doesn’t just store external documents. It extracts knowledge the model already has but doesn’t reliably surface in a single conversation:

What we provided	What the pipeline extracted
DDIA table of contents	37 implementations, 1,405 beliefs, 30 bugs, 7 architectural rules
LeetCode problem statements	510 solutions, 1,641 beliefs, 17 code-level bugs
Enterprise handbook (724 sources)	4,496 beliefs, 136 actionable blockers

Every solution in the LeetCode set passed LeetCode’s test suite. The pipeline found 17 real bugs the generator missed — infinite loops, division by zero, functions named after the wrong problem. The model knew about every one of these issues. It just didn’t activate that knowledge during generation.

The mechanism: each pipeline pass asks a different question. “Write a solution” and “find bugs in this solution” activate different regions of the model’s knowledge distribution. Six passes — generate, explain, extract, derive, review, gate — cover more of what the model knows than any single conversation could.

Beliefs without an LLM

The most surprising result: beliefs are useful to programs that can’t think at all. The dataverse routing system queries the belief network using SQLite FTS5 full-text search — zero LLM involvement. Response time dropped from 10-30 seconds to under 100 milliseconds. Accuracy improved by 8 percentage points. The beliefs are plain text with structure. Any program that can read a database can use them.

Self-correction through adversarial review

The derive-then-review cycle catches 13-37% of derived beliefs as invalid per round. The derive step generates new beliefs by asking “what follows from combining these premises?” The review step evaluates them by asking “does this conclusion actually follow?” Each step activates different knowledge from the same model.

The DDIA case made this visible: cascade impact per retraction declined from 4.1 (round 1) to 0.8 (round 3). The high-impact errors surface first because they sit at the base of the deepest reasoning chains. By round 3, the pipeline has covered most of the model’s accessible knowledge for the domain.

Why Not RAG?

RAG is a probabilistic solution to a bookkeeping problem. It takes your exact documents, converts them to approximate vectors, stores them in a database that returns approximate matches, and hopes the approximation is close enough.

For many use cases, it is. But for belief maintenance — tracking what’s true, what contradicts what, what depends on what, and what changed — approximation is the wrong primitive. You need:

Exact status: Is this belief IN or OUT? Not “similar to beliefs that are in.”
Dependency tracking: If belief A is retracted, beliefs B, C, and D that depend on it must be re-evaluated. Vector similarity can’t express dependency.
Contradiction detection: Beliefs X and Y cannot both be true. This is a logical relationship, not a distance in embedding space.
Derivation: New beliefs generated from combinations of existing beliefs. RAG retrieves; EEM reasons.
Self-correction: 13-37% of derived beliefs are caught and retracted per round. RAG has no correction mechanism.

The exact layer complements RAG — it doesn’t replace it. Use RAG for discovery (“find documents related to X”). Use the exact layer for commitment (“X is believed because of Y and Z, and if Y is retracted, X must be re-evaluated”).

Why Not Fine-Tune?

Fine-tuning takes knowledge and compresses it into weights. You lose provenance (can’t ask “where did you learn that?”), reviewability (can’t read what the model knows), retractability (can’t remove one wrong fact without retraining), and portability (knowledge locked to one base model).

Building a belief network costs $10-$300 depending on domain scale. Fine-tuning costs $10K-$100K+. The belief network is usable from the first belief. Fine-tuning is usable after training converges — if it converges.

If you’re going to fine-tune for domain knowledge, you should build the belief network first. The pipeline is a reviewable knowledge curation process — every stage has a quality gate with measured error rates. Export the verified beliefs as training data and you have the best possible training set. But once you have the verified belief network and it produces 88% vs 33%, the question becomes: why fine-tune at all?

Fine-tuning is for skills and behaviors — reasoning patterns, tool use conventions, style adaptation. For knowledge, just use the knowledge base.

The Two-Layer Architecture

The pattern that emerged from a year of multi-agent research:

Layer	What it does	Examples
Probabilistic (the LLM)	Generate, estimate, explore, associate, decide what to look up	Pattern matching, creative reasoning, choosing search queries
Exact (external memory)	Commit, track, retract, verify, persist, derive	Belief networks, justification chains, contradiction tracking, retraction cascades

An LLM without the exact layer is a brilliant thinker with no notebook. It will have insights, lose them, rediscover them, contradict itself, and never know. The exact layer without the LLM is an empty filing cabinet. Both layers are necessary.

This is the same architecture humans have used for 5,000 years. The probabilistic substrate (the brain) decides what matters. The exact layer (writing, accounting, law, science) makes it reliable. Civilization wasn’t built by improving the substrate — it was built by adding the exact layer on top.

Artificial Domain Competence

We call the result Artificial Domain Competence — what you get when you give a general-purpose model access to a reviewed, justified knowledge base for a specific domain.

You don’t need to wait for AGI. Clone an expert repo from the EEM Hub, point your model at it, and you have domain competence today. With whatever model you already have.

The knowledge is portable across models, reviewable by humans, retractable when wrong, and incrementally updatable. It costs $10-$300 to build and works immediately. The Sumerians figured out the principle 5,000 years ago. We just need to apply it to the models that need it most.

Practical Takeaways

If you’re building LLM applications:

Track what your agent believes, not just what it outputs. A belief network with dependency links catches contradictions that output-only monitoring misses.
Use exact retrieval for exact questions. Full-text search over structured beliefs beats vector search when precision matters. The dataverse system queries beliefs via FTS5 at <100ms with zero LLM calls.
Build the knowledge base before you fine-tune. The EEM pipeline is a reviewable curation process with quality gates at every stage. If you’re going to fine-tune, use the verified beliefs as training data. If the beliefs alone get you to 88%, skip fine-tuning entirely.
Separate generation from critique. The same model that generates a bug can find it during review. But only if you run a separate review pass with a different objective. The 13-37% retraction rate measures how much knowledge review adds.
The exact layer is portable. A beliefs database works with any model. Moving between Claude and Gemini costs nothing — query the same database. Compare this to fine-tuning, which locks knowledge into one model’s weights.

The Opportunity

The industry is spending billions making models bigger and chasing AGI. External epistemic memory costs $10-$300 per domain and produces larger accuracy improvements than model upgrades. It’s portable across models, auditable by humans, self-correcting through adversarial review, and available today.

The Sumerians figured this out 5,000 years ago. Intelligence isn’t the bottleneck. Reliable bookkeeping is.

The tools are open source:

ftl-reasons — Justified belief networks with retraction cascades
expert-agent-builder — Pipeline to build domain experts from documentation
expert-service — Dual-path retrieval serving beliefs to any model
EEM Hub — Pre-built experts you can clone and use today

This is a revised edition of the original Clay Tablets post from March 2026. The original described the problem and the early tools. This revision adds three months of evidence: 40+ expert knowledge bases, controlled evaluations, the ADC framing, and the EEM Hub for sharing domain competence across the community.

Share on

X Facebook LinkedIn Bluesky

Ben Thomasson