A harness can make a small model punch above its weight.

The model sets the ceiling. The harness decides how much of it you actually use.

May 19, 2026 · 6 min read

The honest framing

A reasonable claim about agent harnesses is this: a well-designed loop shifts the effective performance frontier of a given model on a given task. A less reasonable claim, which you will often hear, is that a 4B model with a harness can beat a 70B model without one. That kind of statement is task-dependent at best, marketing at worst. The literature does support the careful version. It does not support the loud version.

What follows is a tour of the techniques that show up in almost every modern agent harness, with the original sources. Read them if you want the specifics. The summaries below are deliberately conservative.

ReAct: interleave thinking and doing

In Yao et al. 2023, the authors proposed letting the model alternate between reasoning steps and actions in an external environment, rather than reasoning to completion and then acting, or acting without thinking out loud. On the question-answering and decision-making tasks they evaluated, the interleaved approach tended to outperform either pure chain-of-thought reasoning or action-only baselines.

The mechanism is unglamorous and important: when the model can observe a real result mid-trajectory, it can correct a wrong assumption before it compounds. Without that feedback, an early mistake propagates through every later step.

Reflexion: verbal self-critique as a substitute for gradient updates

Most reinforcement learning needs weight updates. Shinn et al. 2023 introduced Reflexion, a method where the agent reflects in natural language on its own failed attempt, stores that reflection in memory, and tries again. No gradients move; only the text in the context window changes.

On the tasks the paper evaluates, this self-feedback loop improves success rates after several iterations. The gains are not free — every retry costs more inference — but for tasks where you can detect failure cheaply, that trade-off can be very favourable.

Tree of Thoughts: search instead of single-shot

A standard LLM completion is a single trajectory through a tree of possible reasoning paths. Yao et al. 2023 (a different paper from the same author, also published that year) proposed Tree of Thoughts: expand multiple candidate thoughts at each step, evaluate them, and search rather than commit. The harness becomes a search procedure with the model as both proposer and evaluator.

On problems with a clear notion of partial correctness — puzzles, planning, structured games — this search-based approach tends to outperform straight chain-of-thought prompting. On open-ended tasks the benefit is murkier, because evaluating partial answers is itself a research problem.

Self-Refine: the model edits its own draft

Madaan et al. 2023 describe Self-Refine, where a single model generates a draft, critiques it, and rewrites it in successive passes. No external tools, no extra training, just the same model in three different roles. The paper reports improvements across a range of generation tasks compared to the model's first-pass output.

The effect is intuitive once you have watched a long human work. The first draft is rarely the best draft. Iteration helps. It also costs more, and for short factual answers the marginal value drops fast — another reason to apply this technique selectively rather than always.

Toolformer: knowing when to call the calculator

An LLM that knows arithmetic is interesting. An LLM that knows it should not do arithmetic in its head, and reaches for a calculator instead, is useful. Schick et al. 2023 introduced Toolformer, which trains a model to decide when to call an external API mid-generation. The training data is itself constructed by the model, by inserting candidate tool calls and keeping the ones that reduce loss on the surrounding text.

This is technically a fine-tuning result rather than a pure harness result, but it points at the same conclusion: combining a model with the right external capabilities at the right moments outperforms asking the model to do everything itself.

What this does and does not mean

Taken together, these techniques describe a real, repeatable pattern: structured prompting, search, reflection, and tool use each tend to improve task performance, and they compose. The combined improvement on a well-suited task can be substantial. None of this is the same as saying any small model with any harness beats any large model.

The improvements are task-dependent. They cost extra inference, sometimes a lot of it. They can introduce new failure modes — a confidently wrong reflection, a tool call to a stale API, a search that finds a plausible but incorrect leaf. The harness is leverage. Leverage cuts both ways. Used well, it moves the frontier; used carelessly, it adds latency and bills with no improvement to show for them.

References

Yao, S., et al. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. arxiv:2210.03629. https://arxiv.org/abs/2210.03629
Shinn, N., et al. 2023. Reflexion: Language Agents with Verbal Reinforcement Learning. arxiv:2303.11366. https://arxiv.org/abs/2303.11366
Yao, S., et al. 2023. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arxiv:2305.10601. https://arxiv.org/abs/2305.10601
Madaan, A., et al. 2023. Self-Refine: Iterative Refinement with Self-Feedback. arxiv:2303.17651. https://arxiv.org/abs/2303.17651
Schick, T., et al. 2023. Toolformer: Language Models Can Teach Themselves to Use Tools. arxiv:2302.04761. https://arxiv.org/abs/2302.04761