Classical AI Solved Your LLM’s Problems in 1979 (Revised)

9 minute read

This is a revised edition of the original post from February 2026. The original identified five failure modes in multi-agent LLM systems and mapped them to classical AI frameworks. This revision adds three months of empirical data — 40+ expert knowledge bases, controlled evaluations, and measured error rates — that validate the classical predictions with hard numbers.

Every failure mode I’ve documented in this series — stale beliefs, contradictory agents, cascading hallucinations, lost justifications — was identified and formalized by classical AI researchers decades ago. They built systems to address these problems. They published the theory. And almost nobody building multi-agent LLM systems seems to know about any of it.

The difference from three months ago: we’ve now built the implementations and measured them. The classical frameworks aren’t just theoretically applicable — they produce specific, quantified improvements when applied to LLMs.

The Five Failure Modes

Failure	What Happens	Classical Framework	Empirical Measurement
Staleness	Role definitions contain outdated claims	Frame problem (McCarthy & Hayes, 1969)	100% of role files stale in 6-agent audit
Error propagation	Wrong value spreads through dependent beliefs	Dependency-directed backtracking (Stallman & Sussman, 1977)	Cascade impact: 4.1 beliefs per retraction (round 1), declining to 0.8 (round 3)
Circular verification	Tests verify claims against the claims themselves	Odd-loop detection (Doyle, 1979)	5 of 6 verification tests were tautological
Cross-agent divergence	Agents hold contradictory beliefs, neither notices	The merge problem (AGM, 1985)	136 actionable blockers surfaced in a single 12,731-belief network
Fabricated evidence	Agent adds plausible details not in the source	Partially addressed by TMS source tracking	8% of premises contain fabricated specificity

The original post listed “hallucinated evidence” as having no classical analogue. That was partially wrong. Classical TMS tracks justification sources — every belief points to its evidence. The LLM-specific failure is subtler than wholesale fabrication: the model doesn’t invent claims that contradict the source. It adds plausible details the source never mentioned. Redis as a storage backend when the source doesn’t specify one. JIRA review processes when the source says Slack channel. The fabrication is embellishment, not contradiction, and it passes every coherence check downstream.

The Frameworks, With Data

Truth Maintenance Systems (Doyle, 1979)

The theory: Track why you believe things. Every belief has a justification — a record of what evidence supports it. When evidence changes, propagate the change through every belief that depends on it.

The implementation: ftl-reasons — a hybrid TMS with SL justifications, BFS propagation cascades, and LLM-driven semantics. SQLite-backed. The LLM fills the “problem solver” role that Doyle left as an abstract interface — generating new beliefs via derive and evaluating them via review-beliefs.

The measurement: Across 40+ expert knowledge bases, the derive-then-review cycle catches 13-37% of derived beliefs as invalid per round. This is a direct measurement of what TMS was designed to handle: the gap between what a reasoner generates and what survives justification checking.

The retraction cascade works exactly as Doyle described. In the DDIA expert (distributed systems), retracting a single delete-before-rename bug cascaded through 31 derived beliefs. The cascade impact declined across rounds — 4.1 per retraction in round 1, 3.9 in round 2, 0.8 in round 3 — because the high-impact errors sit at the base of the deepest reasoning chains and get caught first.

The depth-8 ceiling is a structural finding Doyle’s framework predicts but didn’t measure: beliefs derived beyond 8 levels of reasoning don’t survive review on any reasoning substrate. At depth 0, retraction rate is near zero. By depth 9+, it’s 100%. The justification chains get too long for the reasoner — human or LLM — to evaluate reliably.

Assumption-Based TMS (de Kleer, 1986)

The theory: Label every belief with its assumptions. Maintain a database of contradictions (nogoods) — sets of assumptions that can’t all be true simultaneously. When a new contradiction is found, record it permanently so it’s never rediscovered.

The implementation: reasons nogood A B records contradictions with entrenchment-scored dependency-directed backtracking. The system examines both justification chains, finds the least-entrenched assumption, and retracts it with cascade propagation.

The measurement: In the redhat-expert knowledge base (12,731 beliefs across 6 departments), gated belief analysis surfaced 136 actionable blockers — problems no one asked about, emerging from the structure of knowledge itself. These are concrete observations that block general conclusions: “We believe the deployment architecture is production-ready” is blocked by “TLS verification is disabled in health checks.” That’s not a vague concern — it’s a specific claim contradicting a specific conclusion, exactly the nogood mechanism de Kleer described.

The AWX expert (codebase architecture, ~500 beliefs) independently found 30 code-level gatekeepers using the same mechanism. Same pattern, different domain.

AGM Belief Revision (Alchourrón, Gärdenfors, & Makinson, 1985)

The theory: When new information contradicts existing beliefs, you can’t keep both. Epistemic entrenchment — a priority ordering over beliefs — determines which ones survive conflict. More entrenched beliefs are harder to retract.

The implementation: Entrenchment scoring in the backtracking algorithm. When a nogood is detected, the system computes entrenchment from source type (observation > derivation > speculation), derivation depth (shallower = more entrenched), and dependent count (more dependents = more entrenched). The least-entrenched belief in the conflicting set gets retracted.

The measurement: The dirty pipeline simulation quantifies AGM’s value. An 85% accurate model across 5 reasoning stages compounds to 44.4% end-to-end accuracy. With the derive-then-review architecture (generate, then filter using entrenchment-guided retraction), the same 85% model converges to 98.2% in 5 rounds. The 13-37% retraction rate per round is the filter. Entrenchment scoring ensures the retraction hits the right beliefs — the weakly-justified ones, not the well-grounded ones.

The Frame Problem (McCarthy & Hayes, 1969)

The theory: How does a reasoning system know what stays the same when something changes? The sleeping dog strategy — assume everything persists unless explicitly changed.

The implementation: reasons check-stale hashes source files and compares. When a source document changes, every belief extracted from it is flagged for re-evaluation. The sleeping dogs get woken.

The measurement: In a 6-agent research programme, 100% of role definition files had staleness issues. Every single one contained outdated claims that the agents were treating as current. The frame problem isn’t theoretical — it’s the default failure mode of any multi-session LLM system.

The belief network partially dissolves the frame problem by making beliefs external to context. When context compacts (the sawtooth — 50-88% information loss per compaction event), the beliefs survive in reasons.db. The model forgets its justification chains. The database doesn’t.

The Genuinely Novel Problem: Fabricated Specificity

The original post identified “hallucinated evidence” as the one failure mode with no classical analogue. Three months of measurement refined this.

Classical TMS assumes justifications point to real things. LLMs can fabricate evidence — but the dominant failure mode isn’t wholesale fabrication. It’s fabricated specificity: the model extracts a claim from a source document and embellishes it with plausible details the source never mentioned.

We measured this directly in Phase 2 of the propose-beliefs accuracy experiment (handbook-expert, 100 sampled premises evaluated against their source documents):

Error Type	Count	Description
Misread source	7	Added plausible details not in the source
Overgeneralized	2	Inferred rationale the source didn’t state
Factually wrong	0	No claims that directly contradict the source

92% precision overall. Zero factually-wrong errors. The model doesn’t contradict its sources — it adds to them. “Token rotation with reuse detection” becomes “refresh tokens stored in Redis.” “Classification via Slack channel” becomes “must review and comment on JIRA.” The fabrications are plausible, internally consistent, and pass every coherence check downstream.

This is the “coherence isn’t correctness” gap. Classical TMS can verify that a conclusion follows from its premises (logical validity). It cannot verify that the premises themselves accurately reflect the source material (factual accuracy). The fix is a premise-review step — checking each extracted belief against its source document before it enters the network. We’ve filed this as the next tool improvement.

Six Passes Extract What Single Conversations Miss

The strongest validation of the classical approach came from an unexpected direction: using the TMS pipeline to extract knowledge the model already has but doesn’t surface in a single conversation.

We gave the pipeline minimal seeds — a table of contents, a set of problem statements — and ran six passes:

Generate — produce solutions (all passed their test suites)
Explain — examine each solution (found edge cases the generator missed)
Extract beliefs — synthesize cross-cutting patterns (1,641 beliefs from 510 solutions)
Derive — build reasoning chains (113 derived beliefs across 8 depth levels)
Review — evaluate derivations adversarially (28 invalid, 29 insufficient, 15 unnecessary)
Gate — identify concrete bugs blocking general claims (17 real bugs in LeetCode, 30 in DDIA)

Same model throughout. Each pass activated knowledge the previous ones didn’t. The pipeline’s value is in the structured iteration — exactly the generate-then-critique loop that TMS was designed to support.

The model has the knowledge. The classical architecture determines how much of it you extract.

Why the Classical Solutions Didn’t Port Directly — But the Principles Did

The classical frameworks assume formal logic. Beliefs are propositions. Justifications are logical derivations. Contradictions are formal inconsistencies. LLMs work in natural language. Their beliefs are sentences. Their justifications are semantic, not logical.

You can’t run Doyle’s original TMS on natural language. You can’t compute AGM entrenchment over paragraphs of text. The formal machinery doesn’t apply.

But the principles apply perfectly — and when implemented with LLMs filling the problem-solver role, they produce measurable results:

Principle	Classical Source	LLM Implementation	Measured Effect
Track justifications	TMS (Doyle, 1979)	`reasons.db` with SL justifications	Beliefs survive compaction; errors traceable
Propagate retractions	TMS cascade	BFS propagation in ftl-reasons	4.1 → 0.8 cascade impact across rounds
Record contradictions	ATMS nogoods (de Kleer, 1986)	`reasons nogood`	136 blockers surfaced in 12,731-belief network
Entrenchment scoring	AGM (1985)	Depth + source type + dependents	85% model → 98.2% via guided retraction
Detect staleness	Frame problem (McCarthy & Hayes, 1969)	`check-stale` with source hashing	100% staleness in unmonitored files
Separate generation from critique	Generate-then-test (all classical AI)	Derive-then-review	13-37% retraction rate proves complementary knowledge activation

The tools I built are practical approximations of these classical frameworks. They use natural language instead of formal logic. They use LLMs instead of theorem provers. They use SQLite instead of RETE networks. They’re cruder than the classical systems in formalism but far more capable in scope — because LLMs can reason about natural language claims that formal systems never could.

The Takeaway

If you’re building multi-agent LLM systems and hitting problems with stale beliefs, contradictory agents, cascading errors, or lost justifications — the theory exists. It was published in the 1970s and 1980s. The implementations exist too. And now the measurements exist: 88% vs 33% accuracy, 13-37% self-correction rate, cascade propagation that catches errors across 8 derivation depths.

Read the originals if you’re interested:

Doyle, J. (1979). “A Truth Maintenance System”
de Kleer, J. (1986). “An Assumption-based TMS”
Alchourrón, Gärdenfors, & Makinson (1985). “On the Logic of Theory Change”
McCarthy & Hayes (1969). “Some Philosophical Problems from the Standpoint of Artificial Intelligence”

Or just clone an expert and start using it:

# Clone a pre-built domain expert
git clone https://github.com/eem-hub/ddia-expert

# Or build your own from documentation
expert-build init my-domain --domain "My Domain"
expert-build fetch-docs https://docs.example.com/
expert-build summarize
expert-build propose-beliefs
expert-build accept-beliefs

The classical AI researchers solved these problems decades ago. LLMs are the problem solver they were waiting for. The combination works.

This is a revised edition of the original post from February 2026. The original mapped failure modes to classical frameworks. This revision adds empirical validation: 40+ knowledge bases, controlled evaluations, the propose-beliefs accuracy experiment, and the six-pass parametric knowledge extraction pipeline.

*Previously: Clay Tablets (Revised). Tools: ftl-reasons

expert-agent-builder

EEM Hub*

Share on

X Facebook LinkedIn Bluesky

Ben Thomasson