The Expert Agent

7 minute read

A repo is an expert.

Not metaphorically. Literally. When you build up a repository with a CLAUDE.md defining a role, reference materials the agent has read, entries recording what it’s learned, beliefs tracking what it currently holds true, and nogoods recording what failed — you have constructed a domain expert that can be instantiated in seconds.

The agent isn’t an expert because the LLM “knows” the domain. The LLM has read everything and remembers nothing reliably. The agent is an expert because the repo remembers for it. The CLAUDE.md says what the agent cares about. The references say what the field knows. The entries say what this specific agent has discovered. The beliefs say what it currently holds true and why. The nogoods say what it tried and what failed.

Strip away the repo and you have a generalist. Add it back and you have a specialist who can pick up exactly where it left off.

Five Programmes in Four Days

I’ve been running a research institution studying how AI can create original art, music, story, and animation from first principles — no diffusion models, no models trained on human art, pure procedural code directed by LLM cognition.

In four days, five programmes produced 163 entries, registered over 130 beliefs, recorded 9 nogoods, and generated 284 output artifacts — compositions, drawings, stories, animations.

The music PI read 17 books on theory and acoustics, 13 papers on algorithmic composition, and 38 Wikipedia articles on harmony and synthesis. Then it composed. It progressed through 14 phases — from a single sine wave to a 91-second capstone with three-voice counterpoint, nine simultaneous tracks, dynamic shaping, key modulation, and four synthesis methods. 119 entries. 34 beliefs. 114 compositions.

The story PI extracted 71 testable beliefs from 9 craft sources — Aristotle, Propp, McKee, Le Guin, Truby, and others. Then it used those beliefs as hypotheses and wrote 5 stories as experiments. Each story was verified against the belief checklist. A human reader confirmed emotional resonance on at least one.

The art PI completed 5 phases of drawing — value, form, colour, composition, rendering — producing 87 drawings. Phase 5 failed. Not because the agent lacked skill, but because vector graphics and traditional rendering techniques have a fundamental translation gap. Brush strokes don’t work when your medium is Cairo bezier paths.

So the programme closed. Cleanly, with a record of what worked and what didn’t. And the knowledge transferred into a new programme.

When an Expert Knows It’s Wrong

The art programme’s closure is the strongest evidence that the architecture works.

A lesser system would have pushed through — producing increasingly ugly renderings while insisting the approach was viable. Instead, the art PI registered three nogoods (brush stroke density, hue spacing, coordinate translation), identified the root cause (medium mismatch between vector graphics and traditional art techniques), and recommended pivoting to a style that matches the tool’s strengths.

The replacement — a flat illustration programme — inherited 6 beliefs from art, registered 17 of its own and 3 new nogoods in its first 4 entries, and is already producing coherent visual output. The pivot from “this doesn’t work” to “this does work” took less than a day.

This is what a research institution looks like: programmes that can close without the institution losing what they learned.

The Anatomy of an Expert

An expert agent has five layers:

Role definition (CLAUDE.md) — Not “you are a helpful assistant.” Instead: “You are the principal investigator discovering how humans compose music so AI can produce music from first principles. Your verification standard: output must sound musical, not just technically correct.” The tighter the role, the better the expertise.

Reference library (papers, books, articles) — The field’s existing knowledge, downloaded and available. The LLM has seen most of this in training but can’t reliably recall it. Having the actual documents in the repo means the agent can cite specific pages, check its memory against sources, and discover things the training data didn’t emphasize.

Research log (entries) — Timestamped, immutable records of what the agent has done and found. This solves the “every session starts from zero” problem. A new session reads the entries and knows what’s been tried, what worked, what’s open.

Epistemic state (beliefs, nogoods) — What the agent currently holds true and what it’s ruled out. The story PI’s 71 beliefs aren’t decoration — they’re a working checklist used in production. The art PI’s 3 nogoods prevented the Kurzgesagt programme from repeating the same failures. Together they form a truth maintenance system that the LLM doesn’t have natively.

Checkpoint (session state) — What the agent was doing when the last session ended. This bridges the context boundary so the next session picks up mid-thought rather than re-orienting from scratch.

None of these layers require special infrastructure. They’re markdown files in a git repo. The expertise is in the structure, not the technology.

Why This Matters

A human expert takes years to develop. A PhD is 5-7 years of building exactly these layers: learning the field (references), doing original work (entries), developing judgment about what’s true (beliefs), remembering what didn’t work (nogoods), and maintaining continuity across time (memory).

An expert agent takes days. Not because AI is smarter — it isn’t. But because the bottleneck in expertise development is reading and integrating the literature, and AI compresses that from years to hours. The agent still has to do the work. It still has to compose music and listen to whether it sounds right. It still has to write fiction and check whether it resonates. The compression is in knowledge acquisition, not in skill development.

But the skill development is also faster because it’s recorded. A human forgets the details of an experiment from six months ago. The agent has the entry. A human can’t remember exactly why they abandoned an approach. The agent has the nogood. A human’s beliefs drift without them noticing. The agent’s beliefs have source hashes that flag when the justification has changed.

The Institution Model

One expert agent is useful. Many are an institution.

Five programmes in four days — two complete, one closed with transfer, two active. Each is a distinct principal investigator with distinct references, entries, and beliefs. They don’t share context. They share infrastructure: the same entry tool, the same beliefs tool, the same checkpoint protocol.

The institutional model means I can spin up a new expert in any domain by:

Writing a CLAUDE.md that defines the role and research questions
Curating a reference library (books, papers, articles)
Letting the agent start investigating

The first session produces foundational entries. The second builds on them. By the fifth session, the agent has a genuine research trajectory — it knows what it’s tried, what worked, what’s open. It can explain its reasoning by pointing to specific entries and beliefs rather than generating plausible-sounding justifications from nothing.

The cost of exploring a dead end collapsed from months to hours. The art programme ran for two days, completed five phases, hit a wall, closed, transferred its knowledge, and the replacement programme is already ahead of where the original was. In a traditional research group, that pivot would take months of meetings, proposals, and restaffing. Here it took an afternoon.

The Uncomfortable Part

An expert agent built this way is better at some things than a human expert and worse at others.

Better: literature coverage, systematic exploration, never forgetting a result, never losing track of what’s been tried, consistency of methodology across sessions, willingness to try things that seem unlikely.

Worse: taste, judgment about what matters, intuition about which paths are worth pursuing, knowing when something “feels wrong” before being able to articulate why.

The current division: the human decides what questions are worth asking and judges whether the results are any good. The agent does everything in between — the reading, the exploring, the composing, the recording, the tracking. This is the same division that works in the physics programme, just applied to creative domains.

The question isn’t whether AI can be an expert. Given the right repo structure, it can — 163 entries, 130 beliefs, and 284 artifacts in four days prove that. The question is whether the expertise is deep enough to produce work that matters. The music programme produced compositions demonstrating real understanding of harmonic theory. The story programme produced fiction that made a human reader feel something. The art programme knew when to quit.

Whether any of it is art is a different question — one that still requires a human to answer. But the institution doesn’t need every output to be art. It needs the process to be repeatable, the failures to be informative, and the knowledge to accumulate. On those metrics, it’s working.

Share on

X Facebook LinkedIn Bluesky

Ben Thomasson

The Expert Agent

Five Programmes in Four Days

When an Expert Knows It’s Wrong

The Anatomy of an Expert

Why This Matters

The Institution Model

The Uncomfortable Part

Share on

You May Also Enjoy

The Craft Before It Was Automated

The Power Gap Will Close

Clarity and Portability Are the Same Thing

Python Taught AI to Code