LLM Engineering Is Experimental Science, Not Software Engineering

7 minute read

I ran 23,100 LLM invocations across three controlled experiments to answer a simple question: does giving an AI agent a knowledge base make it more accurate?

The answer is yes — but I couldn’t have known that from a single test run. And along the way, half of my engineering intuitions turned out to be wrong.

The Problem With Intuition

Here’s what I thought would help before I measured:

Confidence gating — Ask the model how confident it is. If low confidence, escalate. Seemed obviously useful.
Expert prompts — Tell the model “you are an expert in X.” Standard industry practice.
Structured beliefs — Give the model a knowledge base of 237 domain facts. Felt like it should help.
Beliefs + expert prompt — Combine both. Should be the best of both worlds.

Here’s what the experiments showed:

Decision	Intuition	Reality (23,100 invocations)
Confidence gating	Should help	Harmful. Confidence correlates weakly with accuracy (r=0.14–0.28). Asking the model to revise based on confidence always reduces accuracy, by 3 to 41 percentage points.
Expert prompt	Should help	Neutral. No statistically significant effect on its own.
Structured beliefs	Should help	Confirmed. +3–8pp on domains the model knows. +15–35pp on domains it doesn’t.
Beliefs + expert prompt	Best combination	Worse than beliefs alone. The expert prompt causes the model to over-trust its own knowledge and under-use the belief tools.

Two out of four were wrong. One was actively harmful. The only way I discovered this was by running controlled experiments with enough repetitions to get statistical significance.

Why Single Runs Are Meaningless

Traditional software is deterministic. You call a function, you get a result. One test is enough.

LLMs are stochastic. The same prompt produces different outputs every time. A design decision that looks great on one run might fail 40% of the time across 100 runs.

My experiments used 10 runs per condition per model. Here’s why that matters:

Sonnet’s accuracy on python agents questions:

Without beliefs: 62% (±22.5% across runs)
With beliefs: 96.5% (±2.3% across runs)

That ±22.5% variance means any single run could show Sonnet scoring anywhere from 40% to 84% without beliefs. One lucky run would look like beliefs don’t help. One unlucky run would look catastrophic. Only 10 runs reveals the true distribution.

Confounds Are Invisible Without Controls

My first experiment tested beliefs on Red Hat Ansible Automation Platform (AAP) documentation — a public domain the models were trained on. Result: beliefs helped 3 of 6 models by +3–8 percentage points. Decent but modest.

My second experiment tested the same beliefs tool on an internal codebase (python agents) — a private project the models couldn’t have seen during training. Result: beliefs helped by +15–35 percentage points.

Same tool. Same experimental design. 10x larger effect.

The difference: on AAP, the models already knew most of the answers from training. The beliefs were competing with parametric knowledge. On python agents, the models had zero prior knowledge. The beliefs were the only source of domain information.

If I’d stopped at the first experiment, I’d have concluded “beliefs are a modest improvement.” The second experiment showed “beliefs are transformative when the model doesn’t already know the domain.” The confound — parametric knowledge — was completely invisible in the first experiment.

Effects Are Model-Dependent

The same design decision can help one model and hurt another:

Gemini 2.5 Flash + beliefs on AAP: −7.8pp (worse with beliefs)
Gemini 2.5 Flash + beliefs on python agents: +15.0pp (better with beliefs)

What happened? On AAP, only the beliefs conditions had tool access. Flash has a tool-deference failure mode — when tools return nothing relevant, it refuses to answer from its own knowledge (15.8% refusal rate). Giving it tools on a domain it already knew actually made it worse.

On python agents, both conditions had equal tool access. The tool-deference problem disappeared. Flash benefited normally.

A design decision that “breaks Flash” was actually a confound in the experimental design. Without testing across models with proper controls, I’d have blacklisted Flash from belief systems entirely — the wrong conclusion.

The Parallel to Other Fields

This transition from intuition-based to experiment-based engineering has happened before:

Medicine went from “this herb seems to work” to randomized controlled trials. The transition took centuries and was resisted at every step. But once you accept that the human body is stochastic — same treatment, different patients, different outcomes — controlled experiments become the only way to make reliable treatment decisions.

Agriculture went from “this field produced more” to Fisher’s experimental design in the 1920s. Fisher invented randomization and blocking because agricultural yields are stochastic — same soil, same seeds, different weather, different results. Single-field observations are unreliable.

LLM engineering is at the same inflection point. The infrastructure is stochastic. Single-run evaluations are unreliable. The same prompt with the same model produces different outputs. The only way to make reliable design decisions is controlled experiments with enough repetitions.

The Cost Is Surprisingly Low

Running 23,100 invocations sounds expensive. It’s not.

Each invocation costs a few cents (Sonnet) to tens of cents (Opus). The python agents ablation — 6,600 invocations, 6 models, 2 conditions — cost less than a day of an engineer’s salary. And it saved us from deploying confidence gating (harmful), expert prompts (useless), and a beliefs+expert combination (counterproductive).

The cost of not measuring is higher: deploying harmful design decisions, optimizing the wrong things, making model selection choices based on single-run anecdotes.

Execution Is Automated. Design Is the Hard Part.

The good news: experiment execution is fully automatable. Our pipeline — runner, scorer, extraction validator — ran 23,100 invocations unattended. You design the experiment, press go, and wait for results.

AI helps with the design too. The second experiment (python agents) was designed by an AI agent that identified the confounds in the first experiment (tool-access asymmetry, parametric knowledge) and proposed fixes (equal tools for both conditions, internal domain with zero prior knowledge). The human approved the design and hit run.

We learned how to do experimental studies by doing them, not by waiting until we had a perfect design:

The first experiment tested a known domain. This understated the effect by 10x — but we didn’t know that until we ran it. The result told us what the confound was, which informed the second experiment.
Unequal tool access between conditions created an artifact. We discovered this by analyzing the results, not by predicting it in advance.
Answer extraction bugs appeared three times across different model families. Each new model formats answers differently. The extraction validator caught these — and each one taught us what to validate next time.

Each experiment improved the next one’s design. The cycle is: design → run → analyze → learn → redesign → run again. The cost of running an imperfect experiment is days of compute. The cost of waiting for a perfect design is never running anything at all.

The bottleneck is experimental design quality, not execution capacity. But design quality improves fastest by doing experiments, not by theorizing about them. Token costs are falling. Compute is abundant. The scarce resource is the ability to design experiments that actually test what you think they test — and that ability comes from running experiments, not from reading about them.

Practical Takeaways

If you’re building LLM systems:

Don’t trust single-run evaluations. Run every design decision 10+ times per condition before committing.
A/B test, don’t just ship. Treat every design choice (prompts, tools, retrieval strategies) as a hypothesis. Test it with controls.
Budget for measurement infrastructure. You need: an automated runner, a scorer (extraction + validation), and an extraction validator. These are your instruments — without them, you can’t do science.
Control for confounds. If your treatment and baseline differ on more than one dimension, your results are ambiguous. My first experiment had a tool-access confound that masked the real effect size by 10x.
Report effect sizes, not just “it works.” +3pp and +35pp are both statistically significant. Only one justifies re-architecting your system.
Test across models. Effects are model-dependent. What helps Opus may hurt Flash. What helps on public domains may not help on private ones.
Expect half your intuitions to be wrong. This isn’t a personal failing — it’s the nature of stochastic systems. The only remedy is measurement.

The Meta-Point

Software engineering succeeded because its infrastructure is deterministic. Write a test, run it once, trust the result. The entire methodology — TDD, CI/CD, code review — assumes determinism.

LLM engineering will succeed when it adopts the methodology that matches its infrastructure: repeated experiments, statistical analysis, confound control, effect sizes. This is what every field that works with stochastic systems has learned, from medicine to agriculture to psychology.

The tools exist. The cost is low. The alternative — shipping based on intuition and anecdote — has a 50% error rate. We measured it.

Share on

X Facebook LinkedIn Bluesky

Ben Thomasson

LLM Engineering Is Experimental Science, Not Software Engineering

The Problem With Intuition

Why Single Runs Are Meaningless

Confounds Are Invisible Without Controls

Effects Are Model-Dependent

The Parallel to Other Fields

The Cost Is Surprisingly Low

Execution Is Automated. Design Is the Hard Part.

Practical Takeaways

The Meta-Point

Share on

You May Also Enjoy

LLMs Don’t Need Bigger Models. They Need Clay Tablets.

The Expert Agent

The Craft Before It Was Automated

The Power Gap Will Close