Harness Engineering Essentials: Nine Practices for Taming AI Coding Agents

2026/04/12

In late 2025, Anthropic published Effective Harnesses for Long-Running Agents, introducing the concept of a harness — using architectural means to constrain an unstable model into a sustainably running engineering system. They later expanded on multi-agent division of labor and evaluation separation in Harness Design for Long-Running Application Development.

The book Harness Engineering: Claude Code Design Guide takes this idea further by analyzing Claude Code's source code to systematically break down the engineering design across prompt control planes, execution loops, tool permissions, context governance, error recovery, multi-agent verification, and team adoption. The book contains extensive source-level analysis of Claude Code internals — this article strips that away and retains only the transferable best practices.

1. Why Harness Engineering

  • Models don't deserve inherent trust. A talking probability distribution, once given access to terminals and files, escalates risk from the rhetorical level to the execution level. A text-only model that makes mistakes merely adds communication overhead, but a tool-wielding model that makes mistakes leaves real consequences — deleted files, killed processes, rewritten Git history. Agent systems entering real engineering environments must first acknowledge that their core component is unstable. Ignore this, and the problems will eventually surface in logs and incident reports.
  • Constrained execution is the core capability. Models make mistakes, tools amplify their consequences, context bloats, state pollutes the next turn, users interrupt, and failures recur. A system cannot maintain order through "cleverness" alone — only through structure. Structure isn't as flashy as cleverness, but it's usually more reliable. A truly usable agent system cannot rely on a single "magic prompt" to solve everything; it must decompose control into layers, and layers into responsibilities.
  • A harness is an entire control plane. It encompasses prompt constraints, execution loops, tool scheduling, permission approval, error recovery, and more — all converging on one goal: making the model produce tolerable behavior despite being unreliable. High-risk capabilities demand high-density constraints — the more powerful the capability, the finer the control, because the real world won't automatically forgive an erroneous execution just because the model sounded confident. Taken together, Harness Engineering isn't mysterious; it simply insists on a few commonly overlooked engineering fundamentals.

2. Prompts Are a Control Plane, Not a Persona

  • Persona and control operate on different layers. Persona descriptions address "what it resembles"; a control plane addresses "what it can do, when it should act, what happens when it fails, and who backstops it." A system can have a likable persona while completely lacking discipline at the execution level — such systems tend to seem very sincere when things go wrong, because they're great at apologizing, but apologies cannot substitute for runtime design. For an agent that reads files, invokes tools, touches the shell, handles permissions, and executes across turns, the prompt is closer to a runtime protocol than a character biography.
  • Prompts should be assembled in layers, not written as a single monolithic text. Mature systems don't pin their faith on a single version of a prompt but treat it as a hierarchical configuration system: identity declarations, system-level rules, engineering constraints, and domain behaviors each managed independently. When newly added reminders and prohibitions conflict with each other, system behavior becomes unpredictable — decomposing into layers and responsibilities is the proper approach. A model that automatically "optimizes everything it touches" may seem enthusiastic from a product perspective, but it's quite dangerous from an engineering perspective, which is why engineering constraints (don't overstep scope, don't conceal verification failures, don't create unnecessary abstractions) must be explicitly specified in the prompt.
  • Prompts must have a priority mechanism. Different contexts (coordinator mode, agent mode, user overrides) should have an explicit priority ordering, rather than last-write-wins. The key principle: new agent instructions can only layer domain behavior on top of default constraints — they cannot replace the entire discipline. Think of it as general regulations plus a job description — the job description can add responsibilities but cannot override the foundational regulations, or the system will quickly devolve into anarchy. Customizability without structure ultimately degenerates into just another form of chaos.
  • Prompts must also connect to the memory system. A mature prompt doesn't just prescribe "how to execute this turn" — it also prescribes "how long-term memory should be formed": what to save, what not to save, how to separate indexes from content, and how plans and tasks shouldn't be misused as memory. Once you reach this point, the prompt can no longer be merely a matter of tone — it necessarily enters the domain of institutional design. It extends the prompt's responsibility from "constraining current behavior" to "constraining how future knowledge is accumulated," making it closer to a knowledge governance protocol for runtime participants.

3. The Query Loop: Heartbeat of an Agent System

  • Agents depend on stateful execution loops, not request-response. To gauge whether an agent is mature, first check if it maintains cross-turn execution state: message history, recovery counts, compaction tracking, tool context, turn counters, and more. Once designed this way, the system formally acknowledges that problems left over from the previous turn will enter the next, and the system must be capable of continuing to handle them. Whether a system deserves to be called an agent often depends not on whether it can speak, but on whether it still knows what it's doing after several turns. Scripts only care whether this step finished; agent systems must also care whether, after this step fails, the next step can pick up the state left behind.
  • Context governance precedes model inference. Before invoking the model, the runtime should complete a series of housekeeping tasks: extracting valid messages, trimming tool results, compressing history, and collapsing context. Many systems do the opposite: stuff massive context in first, then hope the model figures out what's important on its own. That approach seems efficient but is really offloading the runtime's responsibility onto a probability distribution. Don't hand the model the job of extracting order from chaos — have the runtime complete governance first, then pass cleaner input to the model. Tidy the scene before starting execution; this approach isn't glamorous, but it's usually more dependable.
  • Interruptions and recovery must be first-class semantics. Whenever a system has committed to an execution sequence, it must settle the books upon interruption — it cannot pretend previous actions never happened just because the user interrupted. Tool calls already dispatched but not yet completed must generate compensating results, ensuring the execution trace remains explainable. Recovery isn't simple retry but a layered attempt from lowest to highest cost and destructiveness. Stop conditions must also be differentiated: streaming completion, user interruption, prompt-too-long, output truncation, hook blocking — each taking a different path. Distinguishing "retry on failure" from "knowing when not to retry" is a hallmark of a mature system.

4. Tools, Permissions, and Interrupts

  • Tools are managed execution interfaces, not extensions of model capability. Tools are not opinions — tools are actions. Actions leave consequences; consequences touch the real world. The model proposes actions; whether they proceed is decided by the runtime, rules, and the user. Permission outcomes should have at least three states: allow, deny, and ask. The "ask" third state is crucial — it acknowledges that the system itself shouldn't make every decision on the user's behalf. Understanding intent does not equal having authorization, let alone having ongoing authorization. The system must separate "being capable" from "being permitted."
  • Tool scheduling must preserve causal order. Once a tool system allows concurrency, it must answer an old question: who determines context changes, and in what order do they take effect? The correct approach is that even when execution is concurrent, semantic context evolution maintains a deterministic order — cache modifications first, then replay in original order. Concurrency can improve throughput but must not break causal order. A tool system without scheduling discipline only amplifies the model's instability into the external world; unconstrained concurrency widens the blast radius.
  • High-risk tools must be treated differently. An interface like Bash, which is virtually unconstrained by domain boundaries, must be treated as a special case — it can directly touch files, processes, networks, and Git repositories, and carries complex shell semantics like redirection and piping. The correct approach is to build dedicated permission checks, command prefix parsing, subcommand count limits, and exhaustive operational rules specifically for it. High-risk capabilities should not receive the same treatment as general capabilities — the more general the capability, the more special oversight it needs. Treating Bash as an ordinary tool is usually just laziness in design.
  • The tool system protects not just the user, but also the system itself. Incomplete execution results, out-of-order context modifications, unbounded concurrent side effects, and unclear interruption semantics — these problems most quickly destroy system consistency. The purpose of constraining tools is to ensure that "what was executed, what wasn't completed, and why it stopped" always forms a traceable causal chain. Many constraints appear to prevent user errors on the surface, but at a deeper level they prevent the system itself from becoming an inexplicable heap of state fragments. Unexplainable execution traces inevitably become ops problems, audit problems, or long-term liabilities that nobody on the team can untangle.

5. Context Governance: Memory and Compaction

  • Context is a budget, not a warehouse. "More information makes the system smarter" is a common myth. Context is first and foremost an expensive, bloat-prone, self-contaminating resource — an agent system is not a library, and the model is not a librarian. Long-term rules, persistent memory, session continuity, and ephemeral conversation should be governed in separate layers, not mixed into one pot. Stable team conventions and repository constraints have lifespans far exceeding any single user message; if everything gets stuffed into chat history, you either redundantly inject every turn wasting context, or rely on the model to recall things on its own, which will eventually fail.
  • Memory entry points must stay small. Index files are naturally loaded frequently, and once frequently-loaded content grows fat, it slowly drags down the entire context system. Long-term memory should be split into "entry points" and "body": entry points for low-cost addressing, body for high-density content. Entry point files should have hard limits — exceed them and the system truncates with a warning, moving details to separate files. Once an entry point serves as both directory and content, it ends up being neither — just an abandoned summary nobody wants to read a second time.
  • Session continuity requires structured summaries, not chat logs. Session memory should be distilled into a continuation-ready operating manual: current state, pitfalls encountered, files modified, and what to pick up next. It doesn't aim to fully replicate the conversation but to compress the essential skeleton needed to keep working. Summary budgets must also be controlled, prioritizing "current state" and "error corrections" — the parts most useful for the next execution step. A truly mature system treats "preserving the most useful parts for continuing work" as a virtue, because context budget is working memory, and working memory's first duty is to be actionable.
  • Compaction's goal is rebuilding work semantics, not writing a nice summary. Post-compaction context must restore plan state, file state, skill constraints, tool attachments, and other runtime environment — the summary is merely an intermediate artifact, and the real goal is to lay flat "the work foundation needed to keep going." Compaction is therefore more like a controlled restart than a chat summary — old context gets translated into a new work foundation. Systems that only do the first half may "roughly remember" after compaction, but they've already lost tool and plan state, and will spend the next several turns finding themselves again. Context systems should prioritize preserving what maintains action semantics over what appears to have the most information density.

6. Errors and Recovery

  • Error recovery must be layered — don't use one heavy hammer for every problem. Judging whether an agent system is mature should not be based on how human it sounds when things go smoothly, but on how systematic it looks when things break. For example, when a prompt is too long, first try draining known backlogs, then attempt heavier full-text compression — don't immediately rebuild the world. A good recovery system first tries to preserve the finest-grained context, then accepts coarser summary replacements only when necessary. Some errors should be handed to the recovery system for an attempt before deciding whether to surface them to the user — what users truly care about is usually whether the system can keep working.
  • Recovery logic must prevent self-referential loops. If compaction doesn't help, continuing to compact will most likely just replay the same failure in a different pose. The most dangerous errors in a system are when failure branches and recovery branches bite each other and begin infinite self-replication. Any automatic recovery mechanism must be countable, limited, and circuit-breakable — a recovery system that can't stop is like a car without brakes: both are technically called systems, but neither should be on the road. Even "repair actions" themselves need repair strategies, because in practice, a compaction request can itself fail due to context being too long. At that point, the priority is to restore breathing first and discuss information fidelity later.
  • The best recovery after truncation is continuation, not summarization. When output is truncated, the system should continue directly from the breakpoint rather than first apologizing, recapping, or writing elegant filler. First try raising the token cap and re-running directly; if that's still insufficient, append an instruction telling the model to continue from where it was cut off, explicitly requesting no apologies and no review. Every post-truncation recap further consumes budget and increases semantic drift, until the system is no longer doing the task itself but round after round of reviewing itself doing the task. An engineering system's true courtesy lies in not trapping users in a failed state.
  • Interruption is also a failure state that requires semantic closure. A user interruption isn't just "I don't want to watch anymore" — it's a state transition that requires proper wrap-up. Tool calls already dispatched but not yet completed must generate compensating results, ensuring previously committed actions don't become dangling debts. What error recovery truly repairs is not just the error itself, but the system's ability to explain its own behavior — whether the system can articulate "what I just attempted, why it didn't work, and whether to continue or stop now." Once explanatory ability breaks, the system degrades from an engineering object into a mystical one.

7. Multi-Agent and Verification

  • Forking is first and foremost a runtime economics problem. Sub-agents must share cache-critical parameters with the parent (system prompt, context, tool configuration, etc.); otherwise, each fork re-burns the entire token budget, appearing to parallelize for efficiency while actually just parallelizing waste. State isolation is the default ethic — all mutable state is isolated first, and sharing must be explicitly declared, preventing a sub-agent's local chaos from contaminating the main thread. The most valuable aspect of a sub-agent is precisely that it can avoid polluting the main thread with its local messiness: misguided research, temporarily read file states, and one-off reasoning branches. If all of these write directly back to the main context, you only get faster contamination.
  • Research can be delegated; synthesis cannot. What's truly scarce in multi-agent systems is synthesis — compressing the local knowledge each worker brings back into clear, executable, verifiable next steps. The coordinator must digest research results before writing specific instructions; subsequent prompts must reference specific files, specific locations, and specific changes, rather than abstractly stating "based on the previous findings." Without this layer, multi-agent systems quickly degenerate into politely-worded task forwarding machines — every agent is busy, but the system as a whole hasn't gotten any smarter. This is classic engineering division of labor: research can be distributed, but understanding must be re-converged.
  • Verification must be an independent stage. Between "I modified the code" and "the code is therefore correct" lies a very wide river, and models are especially good at building paper bridges across it. Implementers naturally tend to believe their changes are "probably fine," and models even more so — they'll give you changes, explanations, and even plausible-looking test output, but none of that equals the feature actually standing on its own. Verification must become an independent role: those who implement focus on making changes; those who verify focus on questioning whether those changes deserve to survive. The goal of verification is to prove code works, not merely to confirm code exists — otherwise "done" quickly degrades into "I wrote it and I think it's fine."
  • Sub-agents need complete lifecycle management. They should be observable at startup, intervenable before shutdown, with traceable transcript paths, and when the parent task aborts, child tasks must follow. Whether output files should be retained, whether cleanup callbacks have leaked, how to handle state residue after an agent ends — all of these require explicit handling. A multi-agent demo that merely achieves "I can spawn another agent" falls far short; agents must be treated as runtime entities that can leak resources, leave residual state, and become orphans when parent processes end. The real value of multi-agent isn't parallelism for speed — it's compartmentalizing different kinds of uncertainty into different containers and having the coordinator reassemble them.

8. Team Adoption

  • Start by drawing the minimum controllable boundary. Teams don't need to start with hooks and complex skill directories. First, get four things clear: which tasks agents are allowed to participate in, which changes must go through human review, what verification must run after changes, and which resources are strictly off-limits. These four things matter more than any grand slogan. If the acceptable scope isn't defined, people will use agents for things that shouldn't be automated; if review responsibility isn't defined, nobody knows who's the last line of defense when things go wrong; if no-go zones aren't defined, efficiency gains merely widen the blast radius. Many teams ultimately fail not because the agent isn't powerful enough, but because they skipped this step at the start.
  • Unify verification definitions before expanding skill count. The most common failure in adopting an AI coding agent isn't in the prompt or the model — it's that the team has no unified definition of "done." Some think "it runs" is enough; some think "half the tests passed" is fine; some think "the model's explanation sounded convincing" counts. Under these conditions, even the smartest system will only learn to meet the lowest bar. Skills can replicate workflows, but only verification definitions can replicate quality. First define which tasks require independent verification, what verification must minimally include, and how to mark verification failures — once these three things are unified, even with few skills, the quality floor holds.
  • CLAUDE.md is more like a foundation than a bulletin board. Team-level instruction files are suited for stable rules: codebase hard constraints, unified verification standards, collaboration discipline, and output style. They're unsuited for frequently changing temporary processes, operational details used by only a handful of tasks, or steps that belong in scripts or skills. Once written into an encyclopedia, they lose stability and credibility — team members can no longer tell whether it describes current rules or a discussion left over from six months ago. The system will also learn a terrible pattern: treating expired conventions as current law.
  • Layer approvals by risk; introduce hooks last. Permission approvals should be layered by irreversibility and environmental sensitivity (read operations < write operations < pushing code / accessing sensitive environments), rather than applying blunt per-tool toggles, because what teams actually need to control is consequences, not button names. Hooks are an advanced automation interface, best introduced after foundational governance has stabilized — otherwise they easily introduce new complexity: scripts nobody maintains, trigger timing nobody can explain, and debugging costs higher than manual operations. The more mature judgment is to first stabilize the floor with review, CI, and minimal documentation, then consider more complex orchestration.

9. Ten Principles

The book closes by condensing everything into ten principles:

  1. Treat the model as an unstable component, not a colleague. A model may talk like a colleague, but it won't automatically acquire stability, accountability, or sustained judgment. The sooner you accept this, the sooner the system starts adding permissions, recovery, verification, and rollback.
  2. The prompt is part of the control plane. Together with the runtime, tool schemas, memory, and hooks, it forms the control plane. Treat the prompt as a persona setting, and you'll end up with a system that performs well but isn't constrained.
  3. The query loop is the agent system's heartbeat. Input governance, streaming consumption, tool scheduling, recovery branches, and stop conditions are all part of this heartbeat. A system without an execution loop may produce demos, but it doesn't qualify as a runtime.
  4. Tools are managed execution interfaces. Once a model starts touching shell and filesystem, the question shifts from "can it speak" to "will it leave consequences." The more dangerous the tool, the less it should be treated as an ordinary capability.
  5. Context is working memory. Being able to stuff something into context doesn't mean you should. Compaction's goal is preserving the semantic foundation for continued work — the standard isn't "enough" but "governable."
  6. The error path is the main path. Prompt-too-long, output truncation, interruptions, hook loops, and compaction failures are all routine weather for long-session agents. Recovery and circuit breaking must exist by design, not be retrofitted after incidents.
  7. Recovery's goal is to keep working. The best action after truncation is usually continuation; when compaction fails, the top priority is restoring the system's ability to breathe.
  8. Multi-agent's purpose is partitioning uncertainty. Research, implementation, verification, and synthesis go into different containers, with the coordinator converging understanding. The real value of parallelism isn't speed — it's clearer responsibility boundaries.
  9. Verification must be independent — systems cannot grade their own work. For any important task, verification should be an independent stage, ideally with an independent role.
  10. Team institutions matter more than individual skill. An expert can tame an agent through experience alone; a team cannot. Only by institutionalizing individual experience can an agent system become an organizational capability rather than a personal party trick.

For comments and further discussion, mail to [email protected]