Everything about AI Harnesses

1. What a harness is

A harness is the software wrapper around a stateless Large Language Model. The model itself predicts text tokens. It cannot open a browser, query a database, or remember what happened three messages ago. The harness is everything that connects the model to reality: state management, tool execution, guardrails, memory, and the loop that keeps the system running until the job is done.

In practice, harnesses fall into two categories.

Two types of harnesses. Evaluation harnesses measure models. Agent harnesses make them operational.

1.1 Evaluation harnesses

An evaluation harness is a standardized framework for scoring, auditing, and benchmarking model outputs programmatically. It eliminates subjective evaluation ("vibe checks") by mapping outputs to objective mathematical thresholds. It feeds thousands of test cases into a model, parses the outputs, and grades them against known datasets like MMLU or HumanEval. In a CI/CD pipeline, it functions as a software gatekeeper: deployments halt automatically if a model's safety, accuracy, or hallucination rates degrade past a baseline. EleutherAI's LM Evaluation Harness is the most widely used example, and it is the backend engine driving most public LLM leaderboards.

1.2 Agent harnesses

An agent harness is the runtime environment that turns a text-prediction model into an AI agent, able to execute multi-step workflows, manage state, and interact with external systems. It has three jobs. First, state management: holding transient memory across multi-turn interactions. Second, tool integration: bridging text tokens to execution protocols (Anthropic's Model Context Protocol is the cross-vendor standard since its December 2025 donation to the Linux Foundation), letting the model call web search APIs, databases, or terminal consoles. Third, guardrails and intercepts: rate limits, budget caps, loop counters to kill infinite loops, and human-in-the-loop approval checkpoints for high-risk actions.

2. The agent autonomy spectrum

The word "agent" is experiencing massive marketing inflation. Chatbot fatigue and investor demand have pushed software vendors to rebrand basic scripts as "autonomous agents." To build or buy effectively, you have to distinguish between three layers of autonomy. Most of what ships as an agent today sits on the leftmost column.

The autonomy spectrum. Most "agents" shipping today are linear chains on the left. Real agency requires the open loop on the right.

2.1 Linear chains: the fake agents

The marketing says an agent is autonomously executing a multi-step workflow. The reality is a strict, hard-coded sequential pipeline: Prompt 1 summarizes, Prompt 2 translates, Prompt 3 formats. The model fills in the blanks; it does not make independent decisions. If Step 2 returns an unexpected format or an API error, a linear chain cannot pivot. It breaks, throws an exception, and crashes. These look impressive in controlled demos and fall apart on first contact with real data.

2.2 Controlled state graphs: the practical enterprise agents

This is where many production enterprise agents get built today. Frameworks like LangGraph implement this pattern. Human engineers map out a strict flowchart of allowed paths and conditional branches. The LLM does not invent new steps. It acts as a router, deciding which pre-defined path to take next based on the data it sees. The AI cannot wander outside the pre-drawn boundaries, which gives operational safety at the cost of flexibility.

2.3 Pure autonomous agents: true agency

The model receives an open-ended goal, a terminal or environment, a toolset, and a loop. It figures out the entire execution path by itself, writing its own sub-steps in real time. Claude Code operates this way: given a repository and a task, it reads files, writes code, runs tests, interprets errors, and iterates until the tests pass. No human mapped the steps in advance.

3. The ReAct loop

The foundational pattern that turns a text generator into an active decision-maker is the ReAct framework (Reason + Act). It enforces a three-phase loop (Thought, Action, Observation) repeated until the goal is met, the budget runs out, or the harness stops it.

The ReAct loop. Three phases (thought, action, observation) repeating until the exit gate fires.

3.1 How the harness transforms a model into an agent

To turn a raw LLM into an agent, the harness injects a strict system prompt that defines the loop:

SYSTEM ROLE INSTRUCTIONS:
You are an autonomous executor. Solve the user's goal
using a strict, iterative loop:

1. THOUGHT: Reason about the current state and identify
   the missing data.
2. ACTION: Call exactly one tool from the allowed list
   using the format: ToolName[parameter].
3. OBSERVATION: Stop generating text immediately and wait
   for the harness to provide execution results.

Exit Gate: When the objective is fully solved, respond
with: FINAL_ANSWER[output].

The model generates a thought. The harness parses the tool call. The harness executes the tool. The harness feeds the result back into context. The model sees the new information and reasons again. This cycle repeats until the model hits the exit gate or the harness enforces a stop condition (budget, time, loop count).

3.2 Execution trace: why observations change the path

The architectural significance of ReAct is that each observation can redirect the agent's entire trajectory. A linear chain cannot do this. Consider a six-turn execution:

Turn 1. The model checks short interest on a stock. The harness returns: 22% short float. This is high.
Turn 2. The model decides to check options volume for a gamma squeeze signal. The harness returns: massive out-of-the-money call spike.
Turn 3. The model checks financial news for a corporate catalyst. The harness returns: nothing found.
Turn 4. No corporate catalyst means this is momentum-driven. The model pivots to retail sentiment. The harness scrapes social media and returns: coordinated retail buying campaign detected.
Turn 5. Risk confirmed. The model deploys a protective hedge. The harness executes the order.
Turn 6. Objective met. The model issues its final answer with a summary.

If Turn 3 had returned a press release (an acquisition, a regulatory filing), the agent would have bypassed social scraping entirely and moved toward corporate valuation. The observation altered the logical trajectory.

The same dynamic shows up in everyday engineering work. Take the goal "fix the failing CI build in this repo":

Turn 1. The model runs npm test. The harness returns: three failing tests, all in auth.test.js, all timing out at 5 seconds.
Turn 2. The model reads auth.test.js. The harness returns: the tests await a JWT verification call that should resolve in milliseconds.
Turn 3. The model reads auth.js. The harness returns: the verification path calls Redis to check a token revocation list.
Turn 4. The model checks whether Redis is reachable from the test environment. The harness returns: connection refused on port 6379.
Turn 5. The model reads docker-compose.test.yml. The harness returns: Redis is mapped to port 6380 in the test compose, not 6379.
Turn 6. The model patches the test config to point at 6380 and reruns the suite. The harness returns: all pass. Objective met.

If Turn 4 had returned a successful Redis connection, the agent would have pivoted to JWT signing keys or token format mismatches. Either way, the observation altered the trajectory. This is the difference between an agent and a script.

4. Writing loops, not prompts

The modern paradigm, pioneered by tools like Claude Code, bypasses the pursuit of the "perfect single prompt." Engineering effort has moved into writing deterministic verification loops. When developers say they "just write loops now," they mean they have shifted from acting as prompt authors to acting as automated system supervisors.

Harness engineering has become its own discipline. A 2026 data point: LangChain's coding agent gained about 13.7 points on Terminal-Bench from harness changes alone, with no model upgrade. That gap is roughly the size of an entire model generation.

The verification loop. The agent writes code, the harness runs it, catches errors, feeds them back, and loops until tests pass. The human defines the validation criteria, not the code.

4.1 What Claude Code does under the hood

Claude Code operates inside a developer's repository via a CLI. Instead of a human copying code, running it, seeing an error, and pasting the error back to the AI, the harness handles the entire cycle natively. It attempts to run the software, intercepts the bash error code, feeds the exact stack trace back into the model, and instructs it to fix the problem. This automated iteration loop improves functional code accuracy over static single-turn prompting.

4.2 The role shift

The engineer's job changes. Instead of authoring the code, the human writes the validation rules: "The task is complete when npm test yields zero errors and the security audit passes." The engineer can start multiple independent terminal loops, define the constraints, and let the agent harness work through hundreds of iterations, self-correcting compile errors along the way. The human is the constraint manager. The harness does the throughput.