← Back to blog

Blog

Agent Engineering in 2026: The Harness Is the Product

Across the strongest 2026 operator reports and official docs, the same artifacts keep appearing: progress files, decision logs, checkpointers, approval gates, and runtimes that know when to stop.

A systems illustration of agent engineering in 2026: a model surrounded by files, checkpoints, approval boundaries, and tests that make the next session restartable.
The durable stack is the point: files, checkpoints, gates, and verification surfaces that survive the next session.

Across the strongest 2026 operator writeups and official engineering posts, the same artifacts keep appearing. init.sh. claude-progress.txt. feature_list.json. CLAUDE.md. CHANGELOG.md. NOW.md. MEMORY.md. decisions.md. Git history. Checkpointers. Approval gates.

That repetition is the concrete fact that matters.

It tells you where the field actually moved. The model improved, but the operational center of gravity shifted into files, harnesses, checkpoints, approval boundaries, and runtimes that can stop without losing the work. Anthropic describes an initializer agent that creates init.sh and claude-progress.txt, then later sessions that read those artifacts alongside git history and feature_list.json before continuing. Martin Sukany uses NOW.md, MEMORY.md, and decisions.md to keep a long-running agent from forgetting current work, durable facts, and prior vetoes. Anthropic’s scientific computing workflow uses CLAUDE.md and CHANGELOG.md as the durable plan and lab notebook. LangGraph centers checkpointers. Conductor centers durable checkpoints and human approval gates. OpenAI’s own guide defines an agent partly by its ability to halt execution and hand control back to the user.

This is the important correction to the last two years of agent discourse. Production reliability did not come from discovering a magical prompt. It came from moving critical reasoning into external artifacts and operational surfaces the runtime can inspect, persist, and resume. The best 2026 systems are not just smarter chats. They are controlled environments for work.

The field moved outward. The model got better. The durable system around the model became the real product.

Pattern

The recurring artifacts do not show up by accident. They solve the same four problems in slightly different forms.

First, they recreate the environment. init.sh, sandbox setup, and runtime bootstraps answer a practical question every fresh session faces: what has to be running, where is it, and how do I get back to a known-good working state without improvising. Restart cost is a real performance metric for long-running agents. Cheap re-entry is leverage.

Second, they externalize current state. claude-progress.txt, CHANGELOG.md, task lists, and feature_list.json compress the answer to three questions: what exists now, what changed recently, and what is next. A fresh session should not have to rediscover this by wandering the repo. Good systems make current state legible on arrival.

Third, they separate short-term memory from durable memory. NOW.md is a working set. MEMORY.md is curated long-term context. decisions.md is the veto register. Git history is the detailed audit trail. This separation matters because production agents do not mainly fail from a lack of raw context length. They fail from mixing volatile state, durable facts, negative knowledge, and cancelled decisions into one undifferentiated pile.

Fourth, they govern action. Checkpointers, traces, run state, approval gates, and budget caps exist because side effects are where agents stop being demos. A model can be wrong inside a draft and you lose a minute. A model can be wrong across a shell, a patch tool, or a deployment surface and you lose a week. Durable execution is not a nice-to-have. It is the minimum required for interruption, review, and safe resumption.

The unit of design is the handoff, not the prompt.

A useful way to read the 2026 stack is to ignore the branding and inspect the artifacts. When the artifacts line up, the systems line up. Anthropic, OpenAI, LangChain, Conductor, and serious operators all converged on the same answer because the same constraints keep showing up in production.

LayerArtifact or control surfaceOperational jobRepresentative source
Bootstrapinit.sh, runtime startup, sandbox configRebuild a known environment on fresh entryAnthropic Engineering
Session handoffclaude-progress.txt, feature_list.json, CHANGELOG.mdCarry forward progress, dead ends, and next tasksAnthropic Engineering, Anthropic Science
Durable instructionsCLAUDE.md, AGENTS.md, rules filesKeep project policy and mission in contextAnthropic Science, OpenAI
Working memoryNOW.md, MEMORY.md, decisions.mdSeparate active work, durable facts, and vetoesMartin Sukany, Zak El Fassi
Recovery and replaycheckpointers, run state, tracingPause, resume, inspect, and avoid repeating side effectsLangChain Docs, OpenAI Agents SDK
Governanceapproval gates, budgets, stop conditionsPrevent silent escalation across real-world boundariesConductor OSS, OpenAI

The mature pattern also splits state into two complementary forms.

Human-legible artifacts such as CLAUDE.md, NOW.md, MEMORY.md, and CHANGELOG.md preserve rationale, mission, and negative knowledge in a form an operator can review. Machine-legible artifacts such as feature_list.json, checkpoint tables, trace events, and run-state snapshots give runtimes something they can resume, diff, gate, and analyze automatically. Systems built only from prose become hard to automate. Systems built only from machine state become hard to audit. The durable 2026 stack needs both.

2026 agent engineering is restart engineering.

Strip the branding away and the operating loop is stable.

flowchart TD
    A[Operator intent and policy]
    B[Approval gates and stop rules]
    C[Harness prompts and evaluators]
    D[Runtime and sandbox sessions]
    E[Files and git state]
    F[Checkpoints, run state, traces]
    G[Tools, MCP servers, APIs]
    H[Model]
    I[Tests and verification]
    J[Next-session receipts]

    A --> B --> C --> D --> E --> F --> G --> H --> I --> J
    J --> E
    I --> B

    subgraph DurableFiles[Durable project state]
      E1[CLAUDE.md or AGENTS.md]
      E2[Progress log or CHANGELOG.md]
      E3[NOW.md, MEMORY.md, decisions.md]
      E4[Git history and diffs]
    end

    E --> E1
    E --> E2
    E --> E3
    E --> E4

    subgraph RuntimeState[Runtime state]
      F1[Sessions and RunState]
      F2[Checkpoint store]
      F3[Tracing]
      F4[Require approval]
    end

    F --> F1
    F --> F2
    F --> F3
    F --> F4

The loop is the point: policy shapes the harness, the harness drives the runtime, the runtime writes receipts, and the next session starts from those receipts.

Evidence

The official sources are now explicit.

Anthropic’s engineering post, “Effective harnesses for long-running agents”, describes a two-part harness. The first session uses an initializer agent. Its job is to create an init.sh script, a claude-progress.txt file, and an initial git commit. Every later session reads the progress file, checks feature_list.json, consults git history, restarts services with init.sh, and only then continues the work. That is not a prompt trick. It is an operational scaffold.

The important point is not the specific filenames. The point is the architecture those filenames represent. Anthropic is solving the core long-running problem the same way effective teams solve human shift handoffs: create a reliable entry path, keep a short work log, and preserve recent history where the next worker can inspect it without reconstructing the whole system from scratch.

Anthropic states the core problem directly. Long-running agents operate across discrete sessions, and each new session begins with no memory of what came before. Their solution is equally direct: structured artifacts that bridge context windows. That is the center-of-gravity shift in one sentence. Not more latent brilliance. Better external state.

The follow-up post, “Harness design for long-running application development”, pushes the same argument further. It describes a planner, generator, and evaluator architecture, but the important part is not the number of agents. The important part is what those agents depend on. Anthropic explicitly says performance gains came from decomposing the build into tractable chunks and using structured artifacts to hand off context between sessions. Even in a three-agent design, the real leverage is the artifact layer, not the fact that there are three names in the loop.

That distinction matters because many teams copied the visible surface of multi-agent systems and missed the deeper move. They saw planner, generator, evaluator. They did not see durable files, restart discipline, controlled handoffs, and explicit evaluation criteria. When those teams failed, they concluded they needed better specialists. Usually they needed stronger state.

Anthropic’s research post, “Long-running Claude for scientific computing”, confirms that this is not limited to app scaffolding or frontend demos. The same pattern holds in a scientific workflow with tight coupling and numerical correctness requirements. The root CLAUDE.md carries deliverables and relevant context. The progress file, called CHANGELOG.md by convention in that piece, functions as portable long-term memory. Anthropic is explicit about what belongs there: current status, completed tasks, failed approaches and why they failed, accuracy tables at key checkpoints, and known limitations.

That last item is operational gold. Failed approaches need to be written down because otherwise later sessions reattempt the same dead ends with fresh confidence. Durable memory is not just a place to store wins. It is also a place to store prohibitions, failed paths, and reasons not to retry something. Agents do not merely forget facts. They forget vetoes and negative results. The fix is not more context stuffing. The fix is a memory surface designed to preserve refusal-worthy history.

Martin Sukany’s “Ten Days with an AI Agent” makes the same point from the operator side. He describes a three-layer memory design borrowed from cache hierarchy. NOW.md is the strict working set. MEMORY.md is the curated long-term facts file. decisions.md is the anti-Dory register that records cancelled or paused actions with date, scope, and reason. He uses that register to stop cron jobs and external integrations from re-deriving actions that had already been vetoed.

This is exactly the failure mode many teams misclassify as dangerous autonomy. In practice it is often just decision loss. The runtime restarted. The conversational context vanished. The veto was never promoted into durable state. The system then reasoned its way back into a previously rejected action because nothing in the operational scaffold prevented it.

Sukany’s essay is also valuable because it argues for aggressive size discipline. A hard limit on NOW.md keeps the working set honest. A curated MEMORY.md prevents long-term context from degenerating into transcript sludge. This is a better memory lesson than most vector-store discussions. The problem is not just retrieval quality. The problem is keeping the state surface small enough that the next session can trust it.

Zak El Fassi’s “How Do You Want to Remember?” extends the same discipline. One of the concrete restructures in that piece is adding rationale fields to decisions.md. That small change is more important than it looks. A veto without a reason is brittle. A veto with a reason survives new sessions, new agents, and new humans. It can be retrieved, defended, and compared against new evidence. Durable memory is not just recall. It is preserved justification.

Zak also pushes on semantic density. Weekly summaries, searchable people files, backfilled rationales, and chunk designs that improve retrieval all reflect the same underlying fact: memory quality depends less on total volume than on the shape of the artifacts being indexed. This is the same lesson that shows up in every competent long-running setup. Better documents beat more documents.

Daniel Georgiev’s “What I Learned Running 16 AI Agents on a Single VPS” supplies the cleanest counterexample to agent-count maximalism. After ten weeks of running sixteen agents, he replaced them with two. The lesson was not that agents are useless. The lesson was that orchestration complexity compounds faster than throughput when state coordination is weak. A coordinator and a swarm of specialists sounds efficient until you realize every handoff is another opportunity for context loss, duplicate work, and disagreement about what reality currently is.

Georgiev’s report matters because it refuses the default vanity metric. He does not count agents. He counts what survived production contact. That is the right metric for the whole field.

OpenAI’s guide, “A practical guide to building agents”, aligns with the same operational emphasis. The guide defines agents not just as systems that use an LLM to manage workflow execution, but as systems that can recognize when a workflow is complete, proactively correct their actions, halt execution on failure, and transfer control back to the user. That wording matters. The category is not defined by infinite autonomy. It is defined partly by competent stopping behavior.

That is the correct production definition. The useful agent is not the one that keeps going at all costs. The useful agent is the one that knows when continuation would be dishonest, destructive, or expensive.

The OpenAI Agents SDK docs make the implementation layer visible. The SDK surfaces sessions, tracing, human-in-the-loop, and RunState. The human-in-the-loop docs are especially revealing. Tools and MCP servers can mark calls with require_approval. Pending approvals appear as interruptions. RunState exists so runs can be serialized, paused, approved or rejected, and resumed. The result of a run is no longer just text. It is increasingly a control object.

That is the shape of maturity. Real agent systems now return resumable state, approval metadata, traces, and transcripts fit for replay. Once again the operational center moved outward from the model and into the harness.

LangChain says the quiet part out loud in “Building LangGraph: Designing an Agent Runtime from first principles”. LangGraph was built around control and durability, with as little abstraction as possible. That design choice tells you what production users actually demanded. Not a magic prompt layer. Not another agent persona format. Control and durability.

The associated LangGraph docs on “Durable execution” are even more concrete. Durable execution is enabled by specifying a checkpointer. Workflows resume from the last recorded state. Non-deterministic operations and side effects need to be wrapped so they are not repeated on resume. Separate side effects should be isolated so replay can recover results from persistence instead of doing the work again. This is the kind of detail that differentiates a real runtime from a toy loop. Production systems are defined by what they do when interrupted halfway through a side effect.

The same pattern shows up in the runtime code surfaces people now choose. LangGraph has Pregel and checkpoint abstractions such as BaseCheckpointSaver, SqliteSaver, and PostgresSaver. OpenAI’s stack exposes sessions, RunState, tracing, and require_approval. OpenHands centers environment boundaries with Runtime, DockerRuntime, ActionExecutionServer, and SandboxConfig. Dapr Agents names the idea directly with DurableAgent. Different codebases, same nouns. The important abstractions in 2026 are not clever thoughts. They are persistence, replay, sandboxing, approval, and state transfer.

Conductor OSS is unusually blunt in “Production Agent Architecture”. It advertises the pattern with no mysticism: every step is a durable checkpoint, human approval is a durable gate, retry is automatic and configurable, memory persists across iterations, budget caps prevent runaway agents, compensation handles side effects, and observability is automatic. That list reads like a summary of the entire field because it is. Once agents touch real systems, the runtime starts to look a lot more like workflow engineering than chat orchestration.

The evidence is now broad enough that it is no longer useful to treat these as isolated best practices. This is a converged operating model.

The same answers keep showing up because the same constraints keep showing up. Context windows still end. Sessions still restart. Sandboxes still die. Side effects still matter. Prior refusals still need to remain refusals. Real systems still need audit trails. The durable agent stack emerged because these problems are universal, not because one lab got temporarily fashionable.

A narrow but practical taxonomy falls out of the evidence.

  1. CLAUDE.md, AGENTS.md, and rules files hold policy, scope, and project identity.
  2. NOW.md, progress logs, and task files hold the live working set.
  3. MEMORY.md, CHANGELOG.md, and curated notes hold durable facts and lessons.
  4. decisions.md holds prohibitions, cancellations, and the reasons they exist.
  5. Git history holds the cheapest high-resolution audit trail.
  6. Checkpointers and run state hold machine-readable recovery state.
  7. Approval gates and budget caps hold the boundary between safe autonomy and real-world consequence.

Every serious 2026 agent system needs answers for all seven. Teams that claim to have an advanced agent platform but cannot point to these surfaces are usually describing a demo harness with better branding.

Implication

The practical implication is that agent engineering is now an operations discipline. Not operations in the narrow sense of deploy scripts. Operations in the broader sense of shaping how work moves, stops, resumes, and gets audited across time.

The model still matters. Better models reduce error rates, compress tool calls, and improve judgment. But model choice now explains less variance than harness quality in many real workloads. Anthropic, OpenAI, LangChain, Conductor, and independent operator reports all point in the same direction. Once the model is competent enough to use tools and follow structure, the determining variable becomes the quality of the surrounding system.

That surrounding system has a few non-negotiable properties.

It must separate instruction from memory. A root CLAUDE.md or equivalent should not become a landfill for every fact. Put stable policy and mission there. Put current work in a small state file. Put durable facts in a curated memory file. Put cancelled decisions in a veto register. This sounds banal. It is one of the highest-leverage design choices available because retrieval quality depends on the shape of the underlying documents.

It must preserve negative knowledge. Teams overinvest in storing summaries and underinvest in storing refusals. Production agents need to remember what not to do, what not to retry, what was blocked by policy, and which approaches already failed. This is why decisions.md, failure notes, and rejected-approach logs matter as much as success summaries.

It must treat git as memory, not just version control. Anthropic’s harnesses explicitly consult git history. That is the right instinct. Git is cheap, already present, and lossless enough to answer questions a summary cannot: what changed, when, in which sequence, against which parent state, and by whom. A well-structured git history is one of the strongest long-term memory systems available because it preserves evidence instead of paraphrase.

It must define stop conditions. “Keep going” is not a stop condition. Production agents need explicit terminal states such as done, blocked, awaiting_approval, awaiting_external_event, needs_human_decision, and aborted_due_to_budget. This is the operational meaning of OpenAI’s emphasis on halting execution and Conductor’s emphasis on durable approval gates. A runtime that cannot stop cleanly is not autonomous. It is just uncontrolled.

It must make replay safe. LangGraph’s durable execution guidance is blunt about idempotence and side effects because resume semantics are where many systems quietly fail. If a resumed run can duplicate API calls, write the same patch twice, or replay a deployment step without realizing it, the system is not durable even if it has a checkpoint object in the codebase. Real durability is not “we saved some state.” Real durability is “resume does not corrupt reality.”

It must expose approval as a throughput tool, not as bureaucracy. Good approval gates do not slow teams down. They move human attention to the smallest set of high-leverage boundaries: network access, shell execution, destructive patches, external communication, payments, deploys, writes to production data. Everything else should continue without drama. OpenAI’s require_approval surface is useful for exactly this reason. It lets the runtime pause at the expensive edge instead of at every edge.

It must make traces inspectable. If an agent touched a shell, a patch tool, or an external service, operators need a legible trail. Run state, traces, diff summaries, memory writes, and approval decisions are all part of the same audit surface. The real operator standard is reconstructability: an operator should be able to explain why the system took an action, from which state, under which rule, and with which evidence.

Once you accept these implications, several design choices become clearer.

The right default memory system is not huge. It is layered. A tight current-state file beats a giant journal for next-session continuity. A curated long-term file beats a raw transcript dump for durable facts. A decision log with rationale beats a vague summary when policy collides with new work.

The right default orchestration is not maximal. It is minimal until the boundaries are real. Add a second agent when it owns a different evaluation function, a distinct tool surface, or an isolated execution environment. Add a third when time horizon or permission boundary justifies it. Otherwise keep the work in one agent and invest in better artifacts.

The right metric is not agent count or tool count. It is continuity quality. Can a fresh session recover state in minutes, not hours. Can it avoid repeating a failed path. Can it halt at the right boundary. Can it resume without duplicating side effects. Can a human inspect the trail and trust it.

The mature stack optimizes for a small set of operational outcomes.

  • Re-entry latency. A fresh session should become useful quickly.
  • Veto durability. A cancelled decision should stay cancelled.
  • Side-effect containment. The runtime should know which actions need review.
  • Replay safety. Resumption should not duplicate consequences.
  • Auditability. An operator should be able to explain what happened.

That is why the strongest 2026 systems look conservative compared with the most viral 2024 demos. They are less enchanted with autonomy. They are more willing to write files, define schemas, set budgets, force stops, and preserve rationale. This is not regression. It is what a field looks like when it starts caring about failure modes more than screenshots.

Contrarian take

Most teams should delete half their agents.

This is not a stylistic preference. It is an operational claim.

Every additional agent adds at least one handoff surface. Every handoff surface creates a state synchronization problem. Every state synchronization problem forces you to answer the same questions again: what is current, what is durable, what was vetoed, what has already been tried, what can be retried safely, and what needs approval before continuation. If those answers are weak, adding another specialist increases confusion faster than it increases throughput.

Daniel Georgiev’s reduction from sixteen agents to two is the clean case study, but the same lesson is visible inside the official sources. Anthropic’s planner, generator, and evaluator architecture works because the roles are genuinely different. The evaluator owns an evaluation function. The planner owns decomposition. The generator owns execution. That is not roleplay. It is division by operational boundary. Most internal agent teams do not have boundaries that clean. They have multiple agents because multiple agents look advanced.

The production bottleneck is usually state discipline, not missing specialists.

Teams often add agents to compensate for weak artifacts. No durable memory file, add a memory agent. No explicit evaluation criteria, add a reviewer agent. No approval surface, add a human proxy agent. No sandbox isolation, add a runner agent. This can work, but it often hides the real problem: the system lacks clean files, clear contracts, and a runtime that can stop and resume safely. Specialist agents are expensive substitutes for good state design.

Keep the specialists that own one of these boundaries.

  • A distinct permission surface, such as production deploy or external communication.
  • A distinct evaluation function, such as formal verification, design review against explicit criteria, or adversarial testing.
  • A distinct runtime, such as an isolated container, browser environment, or long-wait background worker.
  • A distinct time horizon, such as nightly memory maintenance or durable event waiting.

Delete the rest, or collapse them into clearer stages inside one harness.

There is a reason the most reusable 2026 abstractions are RunState, checkpointers, approval gates, sandboxes, and memory files. Those abstractions solve real bottlenecks. Persona multiplication usually does not.

Primary sources

  1. Anthropic Engineering, “Effective harnesses for long-running agents”
  2. Anthropic Engineering, “Harness design for long-running application development”
  3. OpenAI, “A practical guide to building agents”
  4. LangChain, “Building LangGraph: Designing an Agent Runtime from first principles”
  5. LangChain Docs, “Durable execution”
  6. Daniel Georgiev, “What I Learned Running 16 AI Agents on a Single VPS”
  7. Martin Sukany, “Ten Days with an AI Agent”
  8. Zak El Fassi, “How Do You Want to Remember?”
  9. Anthropic Science, “Long-running Claude for scientific computing”
  10. Conductor OSS, “Production Agent Architecture”
  11. OpenAI Agents SDK docs, “OpenAI Agents SDK”

The strongest agent systems in 2026 do not win because they talk the longest or spawn the most specialists. They win because they leave a legible trail, preserve the right files, stop at the right boundary, and resume without lying about state.

In production, the best agent is not the one that sounds smartest. It is the one that leaves state you can trust, crosses boundaries with approval, and stops before it improvises.