The question that drove this work was straightforward. If you gave a 3-billion parameter open-weights model a well-curated research substrate and a decent retrieval layer, could it answer expert-level questions about agent engineering?

The short answer turned out to be: yes on the easy questions, barely on the hard ones, and not in the way we expected.

The long answer is worth writing down because it changed how we think about what small local models need to become useful in production, and it reframed a lot of the work we had planned for the next quarter.

What a Harness Actually Is

The word "harness" gets used loosely, so it is worth pinning down. When we say harness we mean the scaffolding that turns a language model into a goal-pursuing agent. The model on its own is a text predictor. The harness is everything around the model that lets it remember, act on the world, check its own work, and recover when something goes wrong.

We broke it into twelve subtopics and treated each one as a knowledge slot that a serious agent engineer would need to understand. Memory — short-term context, episodic retrieval, rolling summaries, hierarchical stores. Tool use — function calling, parallel invocations, argument repair, result verification. Planning — ReAct, plan-and-execute, tree of thoughts, replanning on failure. Reflection — Reflexion, self-refine, chain of verification, critique-and-revise. Routing — skill dispatch, model cascades, cost-aware selection, fallback ladders. Verification — output schema validation, constrained decoding, grounded generation, contradiction detection. And five more on prompt engineering, error recovery, sampling, state persistence, dialogue, and cost control.

Twelve subtopics, roughly ten slots each, gives you a 120-item map of what "being good at agent harnesses" means in 2026. That map became the target for the research pipeline.

Twelve harness subtopics, each a distinct knowledge slot the model has to cover.

The Pipeline: Four Phases, Built to Measure

We built the research pipeline as a four-phase loop so that every step was measurable and the result would be falsifiable. Inventory, ingest, distill, test.

The inventory phase scored the 120 target slots against twelve existing knowledge graphs we already maintain — general agent literature, AI/ML fundamentals, reasoning, mechanistic interpretability, and so on. That gave us a gap map. 55 percent of the slots already had strong coverage from prior work. 9 percent had weak coverage. 35 percent were entirely uncovered. That gap list became the ingestion target.

The ingest phase fired arxiv queries at every uncovered and weak slot, then did a second pass targeted at named methods like ReAct, Reflexion, Chain of Verification, Toolformer, DSPy, LangGraph, AutoGen, and AgentBench. When we were done the library held 591 curated papers across the twelve subtopics, deduplicated by arxiv ID and organized by slot.

The distill phase ran a deterministic pattern-based extractor over the 591 papers and built a dedicated knowledge graph — 91 nodes across Technique, Tool, Concept, Metric, Problem, and Architecture types, with 228 co-occurrence edges recording which techniques tend to appear together in the literature. That graph was our substrate. The hypothesis under test was that a small local model, asked a hard harness question, would do meaningfully better with access to this graph than without it.

The test phase was where the real work happened. We generated 34 questions — 24 covering the base subtopics, 6 designed as composition-hard questions that required crossing multiple subtopics, and 4 deliberately out of domain as a control group. Every question got two answers from the same 3-billion parameter local model: one closed-book, one with the top five knowledge graph nodes injected into the context. Then a separate, stronger judge graded all the answers on an adversarial rubric covering accuracy, specificity, tradeoffs, and terminology.

If you do not include a control group in a retrieval eval, you cannot tell whether the substrate is helping or whether the grader is rewarding the model for name-dropping.

The Result: Measurable, Modest, and Almost Swallowed by Grader Bias

The naive numbers looked reasonable. Closed-book average score on the base questions was 24.1 out of 40. Open-book average, with the graph nodes retrieved, was 27.3. That is a gain of 3.2 points, which is not nothing. Hard composition questions showed a larger gain: 18.5 closed-book against 24.3 open-book, a delta of 5.8 points. The open-book variant won every comparison.

We would have called that a success and moved on, except the control group told a different story. On the four questions deliberately outside the harness domain — Python internals, PostgreSQL MVCC, TCP versus UDP, the bias-variance tradeoff — the open-book answers scored four and a third points higher than the closed-book answers on average. Those should have been a tie. If the graph cannot help on a Python question, the graph cannot help on a Python question. A +4.3 delta on the control set meant the judge was rewarding the model for name-dropping techniques from the graph whether or not those techniques were relevant to the question.

Subtracting the control bias flipped the base-question result from +3.2 to negative one. The hard-question result dropped from +5.8 to +1.5. The substrate had a real, positive effect on composition questions, but it was substantially smaller than the headline numbers suggested, and on easy questions the substrate was slightly hurting performance by inducing the model to paste irrelevant terminology into its answers.

The raw gains shrank dramatically after correcting for a +4.3 control-group bias. The substrate is still useful on composition tasks, but the effect size is nowhere near the headline.

One question failed in a way worth calling out specifically. The question was about Reflexion, a well-known reflection pattern in the agent literature. The closed-book answer was workmanlike — not great, but correct in outline. The open-book answer confidently described Reflexion as an attack technique, because the graph had an ambiguous node where "reflexion" appeared in both a reflection-pattern context and an adversarial-agent context. The retrieval layer picked the wrong one, and the model repeated it with full confidence. On a rubric that rewards specificity, a confident wrong answer can score higher than a hedged right one.

What the Failure Mode Actually Teaches

If you stop at the numbers, the honest read is that keyword retrieval over a curated knowledge graph does not turn a 3-billion parameter model into an expert. It makes the model sound more expert, which is not the same thing.

But the more interesting finding is structural. The places where the substrate helped most were precisely the composition-hard questions — the ones where the model had to combine memory and tool use, or reflect-then-route, or design a full harness for a specific constraint. Those are the questions that require cross-slot synthesis, and those are the questions where the base model genuinely benefits from having a dense set of named patterns to anchor on. The places where the substrate hurt were the easy, single-slot questions where the model already knew the answer and retrieval just added noise.

That is actually a strong signal. It says the harness substrate behaves like a specialist reference library: useful when the problem is big enough to need it, harmful when the problem is small enough to drown in it. That is the opposite of how vendors usually sell retrieval augmentation, which is as a uniform uplift across every query.

The right question is not whether retrieval helps a small model. It is whether retrieval helps on the specific class of question you actually need answered. For us, that class is composition over a broad harness vocabulary — and there, the substrate is clearly doing work.

Why the Lever Is the Harness, Not the Model

Pull back from the eval for a moment and look at what the last six months of agent engineering have actually produced. The frontier models got better. The open models got better. But the delta between a state-of-the-art agent system and a naive one is not explained by which model is in the loop. It is almost entirely explained by what the harness around the model looks like.

Two agents using the same model can differ by an order of magnitude in reliability depending on whether the harness includes retry-with-feedback, whether tool arguments get repaired on parse failure, whether the agent asks for clarification instead of guessing, whether there is a verification pass on structured outputs, whether the planner can replan when a step fails. None of those are model capabilities. They are harness capabilities. They are things you build around the model.

The model is the small circle. The harness is everything that lets it reach.

This is why small local models matter more than the benchmark-chasing narrative makes it sound. A 3-billion parameter model with a serious harness around it can do real production work on narrow domains. A frontier model with a lazy harness — prompt plus function calls, no verification, no retry, no routing — will fail in exactly the same ways people have been complaining about for two years. The harness is doing the load-bearing work.

What We Are Changing Based on This

Three things shifted in our engineering plan once the numbers came in.

First, the retrieval substrate is still valuable, but it is a tool for synthesis questions, not a universal uplift. We are wiring it into the specific agents that do design and composition work, not into every agent blindly. Easy, single-slot questions route around it because the base model already handles them cleanly and retrieval just adds bias.

Second, the real lever for a 3-billion parameter model becoming an expert in a domain is fine-tuning on distilled question and answer pairs derived from the substrate, not runtime retrieval over it. The substrate we built is now the seed corpus for a supervised fine-tuning run. That is the move the literature has been pointing at for a year and the eval confirmed it: retrieval makes a model sound expert, training makes it one.

Third, any future retrieval eval we run will include a control group by default. The v1 of this eval was graded by the same model that generated the answers, and it produced beautiful-looking numbers that completely inverted once we added four non-harness control questions and a stronger judge. That is a process lesson we will not forget. Self-grading evals on retrieval systems are essentially worthless because the failure mode — the model rewarding itself for repeating retrieved text — is invisible from inside.

What to Take Away

If you are building agent systems right now, the useful observations from this work are:

The harness is the thing that matters once you drop below frontier scale. Memory design, tool-call repair, output verification, routing — those are the decisions that separate a working system from a demo. The model is a commodity.

Retrieval-augmented substrates work when the question is composition-heavy and hurt when the question is narrow and already covered by the base model. Route accordingly.

If you are evaluating whether a knowledge substrate helps your agent, use an out-of-domain control group. If the control group scores the same on both conditions, your main-group numbers are real. If it does not, your numbers are partly or entirely grader bias.

And finally: the honest negative result is the point. We built the pipeline expecting a clean positive result, got a messy one, and that messy one is what tells you where to invest next. If you are running AI research and your results are always clean, you are probably not measuring hard enough.

Building agent systems for a real program?

We work with teams who need production agents on hardware they control — not demos on someone else's API. Walk us through your constraints and we will show you what the harness has to look like to meet them.

Start a Conversation