Why local models benefit most from a good harness.

A small model on your device, wrapped in a clever loop, is the most private kind of leverage you can get.

May 19, 2026 · 6 min read

The constraint that shapes everything

An on-device model lives inside a budget. RAM, thermal headroom, battery, storage. You cannot just scale parameters until the model feels smarter. The hardware will not let you. The room to grow capability has to come from somewhere else.

An agent harness is one of the few places it can come from cheaply. Adding tool calls, a reflection step, or a small planning module does not change the model. It does not require a bigger checkpoint or a different SoC. It just runs the same model more times, in a more useful pattern. For local AI, that property is the whole game.

The harness lives where the model lives

Most harness compute is just more model inference. If the model is local, the harness is local. Tool calls can be local too — a calculator, a file search, a database query against your own data. Schick et al. 2023 showed that a model can learn when to invoke external tools to improve its outputs; nothing about that pattern requires the tools to be in the cloud.

This is what privacy-preserving leverage actually looks like. The conversation, the documents, the intermediate reasoning, the reflections — all of it stays on the device. There is no round trip for the agent loop because the agent loop is running on your CPU, GPU, or NPU. Airplane mode does not break it.

Memory and persona without sending your life to a server

Park et al. 2023, in the Generative Agents paper, used a memory stream — observations, reflections, plans — to give simulated characters continuity over time. The memory mechanism is independent of the model size; it is structure around the model.

A local assistant can use the same pattern. The "memory" is a file on the device. The reflections are produced by the same local model running over the day's observations. Nothing about this needs to travel to a server. A long-running, continuously-improving local agent is, in principle, a small model plus a well-organised notebook plus a loop.

Multiple agents, all of them local

Wu et al. 2023, with AutoGen, frame agent systems as conversations between specialised roles — a planner, a coder, a critic — coordinated by the framework. Cloud deployments make each role a different model. On a device, each role can be the same local model with a different system prompt.

This is cheaper than it sounds. The weights are loaded once. Switching roles is a context switch, not a model swap. A small local model can play three or four parts in a single trajectory without ever talking to a network. The "team" lives on one chip.

Open-ended learning, locally

Wang et al. 2023, in Voyager, showed an agent that builds and reuses a skill library over time as it explores an environment. The skills are code. The library grows as the agent succeeds. The model itself is fixed; the agent gets more capable because its toolbox gets bigger.

For a local assistant, this maps cleanly onto a skills folder on the device. The model writes a small routine, tests it, keeps it if it works, and reaches for it next time. No retraining, no upload, no shared model state with anyone else's device. The agent learns; your data stays.

The honest trade-off: more inference, more heat

None of this is free. A harness uses more total inference than a single prompt does, often a lot more. Reflection steps, planning steps, retries, tool-call decisions — each is another forward pass. On a desktop with a discrete GPU this is fine. On a phone it matters.

More inference means more battery drain, more heat, and on sustained loads the system will thermally throttle and slow down. A harness that calls the model fifteen times for a question that a careful prompt could answer in one call is not making the experience better; it is just turning the phone into a hand warmer. Good local harnesses are restrained: reflect when the answer is uncertain, plan when the task has multiple steps, retry when you can detect a real failure. Otherwise, stop.

The shape of useful local AI

Put the pieces together and the picture is clear. A modest local model, a thoughtful harness, local tools, local memory, and discipline about when to spin the loop. That combination keeps data on the device, scales capability through structure rather than parameters, and respects the hardware. It is the version of AI that does not need to phone home to be useful.

References

Schick, T., et al. 2023. Toolformer: Language Models Can Teach Themselves to Use Tools. arxiv:2302.04761. https://arxiv.org/abs/2302.04761
Park, J. S., et al. 2023. Generative Agents: Interactive Simulacra of Human Behavior. arxiv:2304.03442. https://arxiv.org/abs/2304.03442
Wu, Q., et al. 2023. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. arxiv:2308.08155. https://arxiv.org/abs/2308.08155
Wang, G., et al. 2023. Voyager: An Open-Ended Embodied Agent with Large Language Models. arxiv:2305.16291. https://arxiv.org/abs/2305.16291