Everything about AI Agents

0. What an agent actually is

0.1 The minimum viable definition

An agent is an LLM in a loop, with tools and memory. The model reasons about what to do next. Tools let it act on something outside itself: read a file, call an API, run code, send a message. Memory lets it carry context from one step to the next. The loop keeps going until the goal is met, the model gives up, or a human stops it.

That's it. Everything else (frameworks, sandboxes, eval harnesses, multi-agent orchestration, computer use) is either making this loop more reliable, more capable, or more governable. The core is small.

The mental shift from a chatbot to an agent is going from one-turn-and-done to many-turns-toward-a-goal, where the model decides the path. A chatbot answers "What's my account balance?" An agent, given "Reconcile last quarter's vendor invoices and flag anything off," figures out the steps, executes them, handles mid-flight errors, and reports back. The human isn't in the loop on every step.

0.2 The plan-act-observe loop

The cycle, broken out:

Plan. The model reads the current context (the goal, what's happened so far, what tools are available) and decides what to do next. This decision is generated as text: usually a tool call with arguments, sometimes just a natural-language thought, sometimes "the goal is met, here's the answer."
Act. Whatever runs the loop (the harness) executes the model's decision. If the model asked to call read_file("/etc/hosts"), the harness reads the file. If the model asked to send an email, the harness sends it.
Observe. The result of the action is fed back into the model's context. The file contents, the API response, the error message, the screenshot. Whatever the action produced is now part of what the model sees on the next turn.
Repeat. The model sees the observation and plans again. New context, new decision. The loop continues until termination: the goal is met, the model decides to stop, the harness hits a budget cap (token, time, tool-call count), or a human intervenes.

The plan-act-observe loop. Every agent has one; the harness is what runs it.

The reason this loop produces interesting behavior (interesting in the sense that it can solve problems the model couldn't solve in a single shot) is that each turn the model sees more information than it had before. Errors become input. Failed assumptions become input. The model's own previous reasoning becomes input. Over enough turns and against the right kind of problem, the loop converges on something useful.

The reason this loop is also hard to make reliable is that each turn the context grows, the model's attention has to span more of it, and small errors compound. An agent that is 90% reliable per step is about 35% reliable over ten steps. This is why everything in production agent design (context management, error recovery, sandboxing, eval) is fundamentally about defending against compounding failure across the loop.

The piece that runs this loop is called the harness. The harness is the code (or framework, or managed runtime) that calls the model, parses the model's output to extract tool calls, executes those tool calls, formats results, manages context, enforces timeouts, handles errors, and decides when to stop. It is the part that is not the model. Every agent has one, even when the developer doesn't think of it as such; calling the API in a while True loop with a couple of if statements is a (minimal, fragile) harness. Anthropic uses "harness engineering" as a term of art for the discipline of building these well, and the term has spread across the field in 2026.

0.3 Agent vs assistant vs workflow vs chatbot

These four words get used interchangeably in marketing, and it's worth nailing the distinctions because the architecture and the failure modes differ.

Four names that get used interchangeably. The architecture and failure modes differ enough to be precise.

Chatbot. Single-turn or multi-turn dialog. No external action, no tools, no persistent memory beyond the current conversation. The model produces text, the human reads it. The risk surface is content quality (hallucinations, bias), not action.

Assistant. Multi-turn dialog where the model has tools and uses them within a conversation, but the human drives the loop. Each turn is the human asking and the model responding (possibly using a tool to do so). ChatGPT with web search and file analysis tools is an assistant by this definition: tools are present, but the human is sending messages and reading replies one at a time. Microsoft 365 Copilot Chat is mostly an assistant in this sense.

Workflow. A predefined sequence of LLM calls, often with branches and conditionals, but the path is structured by the developer up front. Each step is a prompt; each output flows to the next step. Predictable, deterministic-enough, easy to debug. The model is not deciding the path. The developer wrote the path.

Agent. A loop where the model decides what to do next. The path is not predefined. The model can call any tool from its set, in any order, any number of times. The same task can produce different trajectories on different runs. Flexible, but harder to predict and harder to constrain.

The four are a continuum, not a clean taxonomy. A workflow with one model-decided branch is a tiny agent. An agent with a single tool and a tight loop is approaching workflow territory. The line that matters is: who decides what happens next, the developer or the model?

0.4 The autonomy spectrum

Within "agent," there is a spectrum of how much independence the agent actually has. Worth being precise about because enterprise governance conversations live entirely on this spectrum.

Four well-known points on the autonomy axis. Enterprise deployments usually start at human-confirmed and migrate right as trust builds.

Suggestion. The agent proposes; the human accepts or edits before anything runs. GitHub Copilot's inline code completion is the canonical example.

Human-confirmed action. The agent intends to take an action, displays it, and waits for confirmation. Claude Code's "I'm about to run rm -rf node_modules. Confirm?" is this mode. ChatGPT showing "Click Authorize to let this app send messages on your behalf" is the same pattern at OAuth-grant granularity.

Autonomous with oversight. The agent acts without per-action confirmation but the human can see what is happening and intervene. Most production agent deployments live here. Claude Cowork running for an hour on a task, with a visible activity log and a stop button. ChatGPT Operator booking a restaurant reservation while you watch the browser. AgentCore Runtime executing a deep research agent for hours, with traces flowing to observability.

Fully autonomous. The agent runs without any human in the loop. Scheduled tasks. Triggered tasks. Long-running background agents. The accountability question shifts from "did the human approve?" to "did the system the human configured allow this?" which is most of what Section 11 (governance) is about.

The pattern emerging in enterprise deployments: start at human-confirmed, build trust through observability and audit, move to autonomous-with-oversight for well-understood workflows, leave the truly autonomous mode for tasks where the cost of a wrong action is bounded.

0.5 Why 2025-2026 is when agents stopped being theoretical

People have been talking about agentic AI since the earliest days of large language models. AutoGPT in March 2023 was the first viral agent demo. BabyAGI followed weeks later. Both produced impressive demos that did not survive contact with real tasks. Why did the field move from "interesting demo" in 2023 to Microsoft, Anthropic, AWS, and Google shipping enterprise agents at scale in 2026? Three things converged.

Model capability crossed a threshold. The agent loop is brutal on the underlying model. The 2023-era models could do this for two or three turns and then drift. The current generation (Claude Sonnet 4.5+, Claude Opus 4.6+, the GPT-5 series, Gemini 3 Pro) can sustain coherent agent behavior over dozens of turns. Reasoning models specifically (o3, Claude with extended thinking, DeepSeek R1, Phi-4-reasoning) compound this by spending more inference compute per planning decision.

MCP made tool-calling universal. Before MCP, every agent platform had its own way of describing tools. MCP collapsed that fragmentation in roughly fifteen months. The cost of giving an agent access to a new tool went from "implement an integration" to "install an MCP server." (See the MCP dossier for the full picture.)

Managed runtimes arrived. The hard parts of building production agents (sandbox isolation, state persistence, safe code execution, observability, recovery from failure) are infrastructure problems. Anthropic Managed Agents (public beta April 8, 2026), Amazon Bedrock AgentCore (GA October 2025), Microsoft Foundry Agent Service, Google's Vertex AI Agents all ship the harness and sandbox so the developer can focus on agent logic.

The signal that the threshold genuinely crossed: Andrej Karpathy publicly shifted his stance on agents in late 2025. He had been one of the most prominent skeptics, dismissing early agent attempts as "agentic slop." After spending real time with Claude Code, he changed his mind in public. The other signal is adoption velocity. Roughly 4% of all public GitHub commits are now authored by Claude Code, doubling in a single month. GitHub Copilot reached 26 million users by October 2025, doubling from 15 million in April. These are not pilot deployments. They are agents producing real code that ships to real production.

0.6 The "start simple" rule

The most useful practitioner advice on agents, articulated explicitly by Anthropic in their "Building Effective Agents" post: don't reach for an agent when a workflow will do, and don't reach for a workflow when a single prompt will do.

Anthropic's exact wording: "Start with simple prompts, optimize them with comprehensive evaluation, and add multi-step agentic systems only when simpler solutions fall short."

The practical version, when deciding whether something needs to be an agent:

If the path is fixed and the model is just doing each step → workflow, not agent.
If the path varies by case but is bounded and explainable → workflow with a small router, not full agent.
If the path genuinely depends on what the model finds along the way, and you can't reasonably enumerate it in advance → agent, and design for the failure modes from day one.

1. The agent loop architecture

1.1 ReAct: the foundational pattern

ReAct (Reasoning + Acting) is the paper that articulated the modern agent loop. Published October 2022 by Shunyu Yao and colleagues at Princeton and Google Research (arXiv:2210.03629), it predates ChatGPT's launch by weeks and is the conceptual ancestor of every production agent today.

The core idea: instead of treating reasoning and acting as separate concerns (chain-of-thought for reasoning, function calling for acting), interleave them. At each step the model produces a thought (free-form reasoning about the current situation), then an action (a tool call), then receives an observation (the tool result), then thinks again. Thought / action / observation, repeating.

The benefit ReAct demonstrated empirically: pure chain-of-thought reasoning, without grounding in external data, drifts into hallucination over multi-step tasks. Pure action generation, without reasoning traces, can't recover from errors or update plans. Interleaving them gives the model the ability to explain to itself what it's doing, why, and what it learned from the last observation, and the explanations meaningfully improve subsequent action choices.

The thought/action/observation pattern is now baked into essentially every agent framework. LangChain agents, the OpenAI Agents SDK, Strands, CrewAI, AutoGen all implement variants of ReAct underneath. When you see an agent's trace showing "Thought: I need to find X. Action: search('X'). Observation: ..." you're looking at ReAct.

1.2 Reflexion: learning from failure within a session

ReAct gives the agent the ability to reason about what to do next given the current state. It does not give the agent the ability to learn from its own failures across multiple attempts. Reflexion (Shinn et al., March 2023, arXiv:2303.11366, NeurIPS 2023) is the pattern for that.

The core idea: when an agent fails at a task, instead of just retrying, the agent reflects on the failure, generates a verbal explanation of what went wrong, and stores that reflection in an episodic memory buffer. On the next attempt, the reflection is included in the context. The agent literally reads its own self-critique from the previous trial before trying again.

What makes this conceptually interesting: it's reinforcement learning without weight updates. The "policy improvement" happens by adding text to the agent's context, not by changing the model's parameters. This is fast (no fine-tuning), cheap (no gradient computation), and works with frozen black-box models.

The paper showed Reflexion improving GPT-4 performance on HumanEval from ~80% to ~91% pass@1 on programming. Production agents use simplified versions of this pattern constantly. "If your last three attempts to call this tool failed, write down what's not working and try a different approach" is Reflexion in spirit, even when nobody calls it that.

1.3 Plan-then-execute vs reactive

Two macro-patterns for how the loop is structured:

Reactive (pure ReAct). The agent decides one step at a time. Each turn: look at the current state, think about it, take one action, observe, repeat. No upfront plan beyond the current step. The trajectory emerges as the agent goes.

Plan-then-execute. The agent writes a full plan first (a list of steps, often with dependencies between them), then executes the plan step by step. The plan is generated once at the start; execution may include some adaptation but the structure is fixed early.

Plan-and-revise (hybrid). The agent makes a rough plan, executes some of it, and revises the plan when observations don't match expectations. Most production agents land here in practice, even when their framework calls it something else.

Reactive works best when the path is genuinely unknown in advance. Exploring an unfamiliar codebase. Researching a topic. Reactive trades higher token cost for flexibility.
Plan-then-execute works best when the task has structure the model can identify upfront. Building a feature with a clear spec. Migrating data from one schema to another. Plan-then-execute trades flexibility for predictability and lower token cost.
Plan-and-revise is the practical default. Agents in production rarely commit to either extreme.

1.4 Why the LLM-in-a-loop architecture works

A subtle point worth being explicit about: the agent loop is not stateful in the way most software systems are stateful. The model itself is not remembering anything between turns. Every turn, the entire conversation history (or as much as fits) is fed back into the model from scratch. The model is doing one-step-ahead decision-making, conditioned on the full history.

This sounds limiting and expensive (it is both) but it's also why the approach is so composable. You can swap the underlying model (Claude → GPT-5 → Llama 4) without changing the loop. You can serialize the agent's state to disk between turns and resume later. The model is essentially a stateless function from history to next-decision; the harness manages everything else.

1.5 Where the loop stalls

The loop has predictable failure modes. Naming them is useful both for debugging and for designing harnesses that defend against them.

Compounding errors. An agent that's 90% reliable per step is about 35% reliable over ten steps (0.9^10 ≈ 0.349). Real-world agents often work cleanly for 3-5 steps and then go off the rails.
Context window saturation. As the loop runs, history accumulates. Tool outputs can be large. Eventually the context fills up, and either the harness has to summarize/truncate (lossy), the model's attention degrades on buried-but-relevant parts of the history, or the agent literally hits the context limit and the call fails.
Looping on the same failed approach. The agent tries something, fails, retries with minor variation, fails, retries again. Without a Reflexion-style mechanism to step back and reconsider strategy, the agent can spend dozens of turns rehashing a doomed approach.
Hallucinated tool calls. The model invents a tool that doesn't exist, or calls a real tool with wrong argument types or impossible argument values.
"Lost in the middle." Documented by Liu et al. (2023, arXiv:2307.03172), models attend better to information near the start and end of long contexts, and worse to information buried in the middle.
Goal drift. Over enough turns, the agent gradually loses track of the original goal and starts optimizing for something adjacent. The harness can defend against this by re-injecting the original goal periodically.

1.6 Convergence and termination

When does the loop stop? In practice, three patterns:

Implicit termination. The model decides it's done (produces a final response without a tool call, or explicitly invokes a "submit answer" / "done" tool). The harness sees the absence of a tool call and exits the loop. This is the cleanest case.

Budget cap termination. The harness enforces a maximum: maximum tool calls, maximum tokens, maximum wall-clock time. When the cap hits, the loop ends regardless of whether the agent thinks it's done. Every production agent framework has this. The default is usually 10-50 iterations; deep research agents might allow hundreds or thousands.

Human termination. The user pulls the plug. Required for any agent running autonomously over a long horizon. The "stop" button in Cowork, the kill signal in Claude Code, the abort mechanism in AgentCore.

Convergence (the agent actually achieving the goal) is harder than termination. Termination just means the loop ended. The loop could have terminated because the agent finished, because it gave up, because it hit a cap, or because the user stopped it. Distinguishing these requires the eval layer (Section 8).

2. Tool use and connectivity

Tools are how the agent does anything other than produce text. Without tools an LLM is a clever conversation partner; with tools, it's a system that can read files, query databases, send emails, run code, and chain those actions into outcomes. Most of an agent's apparent intelligence in production comes from the quality of its tools and the discipline of how the harness presents them to the model.

2.1 Function calling fundamentals

The basic flow when a model uses a tool:

The harness sends the model a system prompt that includes the user's request and a list of available tools, each described by name, description, and JSON Schema for its inputs and outputs.
The model produces a response. If it decides a tool is needed, the response includes a structured tool-call block: tool name, arguments serialized as JSON, optionally free-form reasoning.
The harness parses the tool-call block, validates the arguments against the schema, executes the tool, and serializes the result.
The harness sends the result back to the model as part of the conversation, and the model decides what to do next: call another tool, or respond to the user.

OpenAI shipped function calling as an API feature in June 2023. Anthropic followed with tool use in May 2024. Google added function calling to Gemini around the same time. Today every major model provider supports the same general pattern.

Strict mode has emerged as the modern best practice: the harness validates the tool-call output against the schema before executing, and rejects malformed calls. OpenAI's strict mode (August 2024) and Anthropic's tool input validation enforce schema compliance at the API level. Most production agents now use strict mode by default.

2.2 The MCP shift

Before MCP, every agent platform had its own tool definition format. OpenAI assistants used one shape, Anthropic tool use another, LangChain wrapped both with its own abstraction, every framework had its own connector to common services. Adding a new tool meant writing the integration code once per framework you cared about: the M×N problem.

MCP collapsed this. A tool becomes an MCP server (running as a separate process or HTTPS endpoint), and any MCP-compatible host can connect to it. Add a tool once, expose it through MCP, and every agent platform can use it. (For the full picture, see the MCP dossier.)

The architectural consequence inside the agent: tools are no longer baked into the harness. They are discovered at startup (via MCP tools/list), can change between sessions, and can be added or removed without touching agent code.

2.3 Tool annotations

Tool annotations are the metadata that tells the harness what a tool does in terms of side effects. The current MCP spec defines four standard annotations, all hints rather than enforcement:

readOnlyHint: true: the tool does not modify any state. Safe to call freely, including for retries.
destructiveHint: true: the tool may delete, overwrite, or otherwise irrecoverably modify state. Hosts should require explicit user confirmation by default.
idempotentHint: true: calling the tool multiple times with the same arguments produces the same result. Safe to retry on transient failures.
openWorldHint: true: the tool reaches outside the local environment. Relevant for security policies that distinguish local-only operations from network-touching ones.

OpenAI's Apps SDK requires these annotations on every tool. Anthropic's Claude Connectors use them. Microsoft Security Copilot blocks tools with destructiveHint: true from import as a security posture choice.

2.4 Tool selection and the routing problem

When an agent has 5 tools, the model can keep them all in mind and choose well. When it has 50, performance degrades. When it has 500, the agent falls apart entirely. This is one of the most active engineering problems in agent design today. AWS coined "MCP context overload" as the term for it. The mitigations split into a few patterns:

Tool subsetting: load only the tools relevant to the current task.
Tool search / dynamic loading: the model has access to a meta-tool that lets it search for and load other tools as needed. Anthropic's Claude Code introduced this with ENABLE_TOOL_SEARCH=1.
Hierarchical bundles: AWS "Powers" and Anthropic "Plugins" bundle related tools, skills, and documentation together.
Routing agents: a small, fast model classifies the request and decides which tools to expose to the main agent.

The 10/100/1000 rule of thumb that has emerged: under 10 tools, no special handling needed; 10-100, use clear naming and good descriptions; 100+, you need an active routing strategy or your agent quality drops noticeably.

2.5 Tool descriptions matter more than people expect

The model's tool selection is driven heavily by the description field on each tool, not just the name. A tool named search_knowledge_base with the description "Search internal company documentation for policy questions, technical specs, and product information" will be selected differently than the same tool with the description "Search for things." The model is reading the description as part of its planning step.

This means tool descriptions are a serious authoring task. They should be specific, mention typical use cases, and disambiguate from similar tools. It also means tool descriptions are a security surface: malicious descriptions can attempt to manipulate model behavior (tool poisoning, covered in the MCP dossier section 10).

2.6 Computer use as a special tool category

Computer use (letting the agent see a screen and click on it) is a tool, but a peculiar one. The "tool call" produces a screenshot of the screen and lets the agent issue mouse clicks, keystrokes, and scroll commands as outputs. This collapses essentially any GUI application into something an agent can interact with, at the cost of much higher latency, lower reliability, and a much larger sandbox surface than ordinary tool use.

Anthropic shipped Computer Use as a research preview in October 2024 with Claude 3.5 Sonnet. OpenAI's Operator (Pro tier), Google's Project Mariner, and Microsoft's Copilot Studio computer use preview are the equivalent offerings. Phi-4-reasoning-vision-15B adds an open-weight option specifically tuned for GUI grounding. Section 6 covers computer use in depth.

2.7 The three layers: protocol vs framework vs infrastructure

A confusion point that trips up most people entering this space: MCP, agent frameworks (Strands, LangGraph), and agent platforms (AgentCore, Vertex Agent Runtime) are not alternatives to each other. They're three different layers that stack.

Protocol / Framework / Infrastructure: three layers, not three choices.

Layer 1: Protocol (MCP). MCP defines how a client discovers and calls tools on a server. It's a wire protocol. It does not decide which tool to call (that's the model), does not reason about what to do next (that's the framework), does not host or scale anything (that's the infrastructure).

Layer 2: Agent framework. The framework adds multi-step reasoning loops, planning, memory across steps, multi-agent coordination, error handling and retry logic. Without a framework, you have single request-response. With a framework, the agent manages a loop. The framework sits above MCP; it uses MCP tools, it is not an alternative to MCP.

Layer 3: Infrastructure. Where the code runs. Provides serverless hosting, auth management, monitoring, quality testing, persistent memory. Doesn't care which framework you use.

When do you need a framework vs just an MCP server? An MCP server is enough when the interaction is simple request-response. The model itself can also chain multiple tool calls within a conversation, covering many multi-step cases without a separate framework. A dedicated agent framework adds value when the task requires complex conditional logic, multi-system orchestration, retries, or autonomous execution without user input between steps.

3. Memory in agents

3.1 Four memory types worth distinguishing

Borrowed loosely from cognitive science but adapted to how LLM systems actually work:

Four memory types, often not cleanly separated in implementations but useful as a vocabulary.

These are not always cleanly separated in implementations. Letta's three-tier architecture maps roughly to working (core), episodic (recall), and semantic (archival). OpenAI's memory feature is mostly episodic. Anthropic's Skills are essentially procedural memory. AgentCore Memory exposes all four as separate primitives that agents can read from and write to.

3.2 Working memory and the context window

The starting point: every model has a maximum context window. As of May 2026, the leading numbers:

Claude Opus 4.7, Opus 4.6, Sonnet 4.6: 1M tokens
Earlier Claude models (Opus 4.5, Sonnet 4.5, Haiku 4.5): 200K tokens
GPT-5 series: 400K tokens (varies by tier)
Gemini 3 Pro: 2M tokens
Phi-4-reasoning: 32K tokens
Open-weight Llama 4: up to 10M tokens with the longest variants

Bigger context windows reduce the pressure on memory architectures. The practical caveats: cost grows linearly with input tokens, latency grows roughly linearly with input tokens too, and attention quality is not uniform across the window ("Lost in the Middle"). These are why "just use a 1M context window" is not the universal solution it sometimes gets pitched as.

3.3 RAG vs context window

The classic question: when do you put information in the context window, and when do you retrieve it via RAG?

Context window is right when the information is small enough to fit comfortably (under 20% of the window), all of it is potentially relevant to the current task, and the task is a single coherent reasoning step over the whole content.

RAG is right when the information is much larger than the window can hold, only a small subset is relevant per turn, the information is updated frequently, or the information has structure that benefits from explicit retrieval.

The 2024 wisdom was "RAG everything." The 2025 wisdom became "long context for everything you can fit." The 2026 wisdom is the boring middle ground: hybrid, with the choice driven by the specific access pattern.

3.4 The Letta / MemGPT pattern

The most influential architecture for agent memory comes from the MemGPT paper (Packer et al., UC Berkeley, October 2023, arXiv:2310.08560). The framing: treat the LLM's context window like RAM and use the rest of memory like disk, with the agent itself responsible for moving information between tiers. Three tiers:

Core memory: small, always in-context. Functions like RAM. Holds the agent's persona, the user's persona, and a few key facts. The model can read and edit core memory directly via tool calls.
Recall memory: conversation history, searchable on demand. The agent calls a search tool to retrieve past messages by query.
Archival memory: external vector store, queried explicitly. The agent calls archival_memory_search to retrieve, archival_memory_insert to write.

The key innovation isn't the tiers themselves; it's the self-editing memory: the agent has tools that let it modify its own core memory. The Letta pattern shows up implicitly in many other systems (AgentCore Memory's tier separation, ChatGPT's "memories" feature, the M365 Copilot Work IQ context layer).

3.5 Prompt caching as memory primitive

Prompt caching dramatically changes how agents handle repeated context. The idea: the model provider stores a snapshot of the model's internal computation state for a given prompt prefix. When a subsequent request shares that prefix, the model reuses the cached state instead of recomputing it. Cost drops by 75-90% on the cached portion; latency drops correspondingly.

The current state across providers:

Anthropic: explicit cache control via cache_control parameter; 5-minute or 1-hour TTL options. Cache reads cost 0.1× base input tokens. Workspace-level isolation as of February 5, 2026.
OpenAI: automatic for prompts ≥ 1,024 tokens. Cache hits cost 50% of base input tokens.
Google Vertex AI: explicit caching with developer-set TTL.
DeepSeek: automatic prompt caching.

For agents, prompt caching matters because the loop produces highly repetitive prompts: same system prompt every turn, same tool definitions every turn, growing-but-prefix-stable conversation history. Real-world savings: 70-90% of input token costs eliminated for prefix-heavy workloads. Combined with batch API discounts (50% on both Anthropic and OpenAI), stacked savings reach 95% off input costs at Anthropic.

The catch: cache hits require exact prefix matches. Inserting the current timestamp at the top of the prompt breaks caching for the entire request. Production agents that want to use caching well design their prompts with a stable prefix and a dynamic suffix.

3.6 Agent memory as a vendor surface

By 2026, agent memory has become a first-class product feature at multiple vendors:

Anthropic Claude has a memory feature that lets Claude remember user preferences and facts across conversations. Configurable per-conversation, auditable, deletable per-fact.
OpenAI ChatGPT has had memory since 2024, expanded significantly in 2025-2026. Available across Plus, Pro, Team, and Enterprise plans.
Microsoft Copilot has memory integrated into the M365 Work IQ layer, grounded in the user's Microsoft Graph.
AWS AgentCore Memory is a separate primitive in the AgentCore platform that agents read from and write to via API.
Google has memory in Gemini Apps and integrates personal context through Gemini Enterprise's grounding to Workspace.

The lesson from a few years of these features in production: memory is the single feature users notice most when it works, and the single feature they're most spooked by when it goes wrong. Every vendor that has shipped memory has had to invest heavily in transparency (here's what I remember, here's where I learned it) and control (forget this, never remember that).

3.7 The raw vs derived spectrum

Every memory system is choosing a position on the raw vs derived spectrum.

Raw memory stores everything verbatim. The full conversation, every email, every document, with timestamps. Retrieval is exact. The system never loses fidelity but storage and retrieval get expensive at scale.

Derived memory stores summaries, abstractions, embeddings. The system extracts the gist and discards the raw. Retrieval is fast and cheap. The system loses the long tail of detail and the abstraction's quality depends entirely on how well the summarization step worked.

Production memory systems hybridize: keep the raw for some window, derive after that, allow the agent to ask for raw retrieval when the derived isn't enough.

4. Planning and reasoning

4.1 Chain-of-thought as the entry point

Chain-of-thought (CoT) prompting was introduced by Wei et al. (Google Research, 2022, arXiv:2201.11903). The core observation: prompting the model to "think step by step" before answering produces dramatically better results on multi-step problems than prompting for the answer directly.

Why it works: the model is generating tokens autoregressively, conditioning each new token on everything that came before. When the model writes out its reasoning explicitly, those reasoning tokens become input for subsequent tokens, including the final answer. Without CoT, the model is trying to derive the answer in a single forward pass with no scratch space.

CoT is the simplest possible reasoning pattern and the one every other technique builds on. In agent contexts, CoT shows up as the "thought" portion of ReAct's thought/action/observation cycle.

4.2 Tree of Thoughts and search-based reasoning

CoT is linear; the model produces one chain of reasoning and follows it. Tree of Thoughts (ToT) generalizes CoT to a tree. Yao et al. (Princeton + Google, May 2023, arXiv:2305.10601) introduced the framework: at each step, generate multiple candidate "thoughts," evaluate them, pick the best ones, expand the tree from there.

The headline result: on the Game of 24 task, GPT-4 with chain-of-thought solved 4% of problems. GPT-4 with Tree of Thoughts solved 74%. Same model, much better problem solving.

The trade-off is cost. ToT requires multiple model calls per reasoning step. On complex tasks the cost can be 10-50× a single CoT call. This is why ToT shows up in research and specialized applications more than in everyday production agents.

4.3 Reasoning models: the test-time compute shift

The biggest reasoning shift in recent AI is the rise of reasoning models: models specifically trained to spend more inference compute on hard problems before producing an answer. The lineage:

OpenAI o1 (preview September 2024, GA December 2024) was the first widely available reasoning model.
OpenAI o3 (December 2024) extended the approach with stronger benchmarks, particularly on ARC-AGI.
DeepSeek R1 (January 2026) was the first open-weight reasoning model at frontier quality, trained primarily with reinforcement learning from verifiable rewards (RLVR). The release shook the industry; frontier-grade reasoning was no longer a closed-vendor capability.
Claude with extended thinking lets Claude allocate explicit "thinking budget" before responding. The thinking is visible to the user (unlike o1's hidden reasoning) and configurable.
Gemini 3 Pro thinking is Google's equivalent capability, integrated into the standard model.
Phi-4-reasoning, Phi-4-reasoning-plus, Phi-4-reasoning-vision-15B (Microsoft Research, 2025-2026) are open-weight reasoning models distilled from o3-mini's reasoning traces.

The conceptual frame: test-time compute as a new scaling axis. Pre-2024, the dominant scaling strategy was bigger pretraining. Reasoning models opened a second axis: pay more inference compute per query for higher-quality answers.

For agents, reasoning models matter because the per-step decision quality compounds. An agent running a 50-step loop with a reasoning model that is 95% reliable per step ends at 7.7% reliability over the loop. The same agent with a non-reasoning model at 90% per step ends at 0.5%. The 2026 production pattern: route the model used per step to the kind of decision being made. Cheap fast model for routine tool selection; reasoning model for the hard planning steps.

4.4 When explicit planning helps vs hurts

Within the agent loop, there's a question of when to add explicit planning structure beyond what the model does naturally.

Plan-then-execute front-loads planning. Works when the task has structure the model can identify upfront, avoiding mid-task pivots is important, or you want a human review step before action. Hurts when the plan is wrong and the agent over-commits.

Reflexion-style self-correction adds planning after failures. Helps for tasks where failure signals are clear and the model can learn from its own critique. Hurts when the failure signal is ambiguous.

Just-in-time planning is the middle ground: the agent thinks for one or two steps ahead at most, takes an action, observes, thinks again. This is what most production agents do, with reasoning models providing the per-step quality boost.

4.5 Plan caching and reuse

When an agent plans well for a task, store the plan and reuse it for similar tasks. This is "plan caching": caching the reasoning artifact, not the prompt. Versions in the wild: Anthropic Skills, AWS Powers, OpenAI's GPT memory storing procedural patterns, LangGraph's state checkpointing.

An agent that always plans from scratch is slower and more error-prone than one that can recognize "I've done something like this before, here's the proven approach."

4.6 Reasoning model ≠ agent quality automatically

Better reasoning doesn't translate cleanly to better agent behavior in every case. Reasoning models can be slower to start, more likely to over-think simple decisions, verbose in ways that consume context, and sometimes worse at tool-call format compliance. The production pattern of model routing inside the loop addresses this.

5. Multi-agent systems

5.1 Why split work across agents

The case for multi-agent:

Specialization beats generalization at the prompt level. A focused "billing agent" prompt is more reliable than a "do everything" prompt.
Different agents need different tools. Splitting agents lets each one have a tightly curated tool set, avoiding the routing problem.
Different agents need different policies. Customer-facing guardrails differ from internal compliance guardrails.
Parallelism. Multiple agents on independent subtasks at the same time is faster than sequential.
Specialist models. Vision tasks on vision-tuned models, reasoning on reasoning models, summarization on cheap fast models.

The trade-offs are real: more prompts to maintain, more places to break, more inter-agent communication overhead, harder to debug, harder to evaluate end-to-end.

5.2-5.4 Multi-agent patterns

Multi-agent orchestration patterns: orchestrator-as-tools, handoffs, supervisor-and-graph.

5.2 Orchestrator / worker (agents-as-tools)

One orchestrator sits at the top and farms out subtasks to workers. The orchestrator owns the user-facing conversation. The workers don't see the user directly; they receive a focused subtask, do it, return a result.

In OpenAI's Agents SDK this is the agents-as-tools pattern: worker agents are exposed to the orchestrator as tool calls (Agent.as_tool(...)). The orchestrator decides when to invoke a specialist and incorporates the specialist's response into its own reasoning. The user sees one coherent conversation.

OpenAI's published example: a Portfolio Manager orchestrator calls specialist agents (Macro, Fundamental, Quantitative) as tools, runs them in parallel, and synthesizes the combined output. Most enterprise multi-agent deployments end up here.

5.3 Handoffs

Handoffs are the alternative to agents-as-tools. Instead of the orchestrator delegating and incorporating, control transfers from one agent to another. The receiving agent owns the next response to the user.

OpenAI's Agents SDK formalized this pattern. A handoff is implemented as a tool call: when an agent named Triage decides to hand off to Refund Agent, it calls a tool named transfer_to_refund_agent. The SDK intercepts the call, transfers conversation control.


                    OpenAI's guidance: use handoffs when routing itself is part of the workflow and a specialist should own the conversation going forward. Use agents-as-tools when a specialist should help with a bounded subtask but shouldn't take over the user-facing conversation.

                    5.4 Supervisor and graph
                    LangGraph's framing is supervisor: a designated agent that decides which other agent runs next, with the workflow expressed as a directed graph. Supports conditional edges, cycles, and checkpointing. The graph framing trades flexibility for predictability.
                    CrewAI's role-based crew is yet another flavor: agents defined by role/goal/backstory, tasks assigned to roles, a Process (sequential, hierarchical) determines the order. More declarative than LangGraph's explicit graph.
                    AutoGen uses GroupChat: agents take turns speaking, with a chat manager picking who speaks next. Closer to a meeting metaphor than to a graph or handoff.

                    5.5 Debate, critique, and the reviewer pattern
                    One agent produces a draft, another agent reviews it. The drafter / reviewer pattern, sometimes called debate (when adversarial) or critique (when corrective).
                    The M365 Copilot Researcher agent: one model produces the research output, a second model reviews it for completeness and accuracy, the system iterates if the review flags issues.
                    Useful when the task has clear evaluation criteria, a single agent tends to overlook its own mistakes, and the cost of an additional review pass is justified by the quality improvement.

                    5.6 Agent-as-tool vs full multi-agent
                    Agent-as-tool is technically a multi-agent pattern, but the simplest one. The orchestrator is the only one that talks to the user. Workers are essentially function calls that happen to use LLMs internally.
                    Full multi-agent is when agents communicate with each other, maintain their own state, and may run in parallel without a clear hierarchy. Debate patterns, GroupChat, LangGraph supervisor with cycles.
                    The distinction matters: agent-as-tool is much easier to build, debug, evaluate, and govern than full multi-agent. If your task can be solved with agent-as-tool, you should usually do that and skip the rest.

                    5.7 Where A2A fits
                    The Agent-to-Agent protocol (A2A, original spec from Google in 2025, governed by the Linux Foundation since June 2025) is for the case where the agents involved in a multi-agent system are owned by different parties. Within a single application built on the OpenAI Agents SDK, A2A is overkill; the SDK's handoffs handle inter-agent communication natively.
                    A2A becomes important when you have, for example, a Salesforce agent that needs to coordinate with a ServiceNow agent. The 2026 reality: A2A has real adoption among the major vendors but production cross-vendor A2A use is still rare. Most multi-agent today is single-vendor, single-application.

                    5.8 The "you probably don't need it yet" critique
                    Most agentic tasks are best solved by one well-designed agent. A single agent with a focused prompt, a curated tool set, and a reasoning model behind it can do much more than the multi-agent framework documentation implies. The default should be one agent.
                    The legitimate reasons to add structure: the prompt for one agent has gotten too long and unwieldy; the tool set has too many tools (50+) and the model is selecting badly; the policy requirements differ enough that a single guardrail set can't cover everything; the work has independent subtasks that can run in parallel for speed; the work has a clear drafter/reviewer structure where review materially improves quality.
                    The pragmatic position: start with one agent. When you hit a specific, articulable limitation that a known multi-agent pattern addresses, add structure. Resist the urge to build the elaborate multi-agent system before you've shipped the simple version.



                
                    6. Computer use and browser control
                    Computer use is the capability that lets an agent see a screen and click on it. It collapses essentially any GUI application into something an agent can interact with, including applications that don't have APIs, applications behind authentication walls, applications running on a desktop instead of in the cloud. It is also the most expensive, slowest, and most failure-prone tool in the agent toolbox.

                    6.1 What computer use actually is
                    The agent's environment includes a screenshot of the current screen, passed to the model as an image, and a set of action tools the agent can call: mouse_click(x, y), keyboard_type("text"), keyboard_press("ctrl+s"), scroll_down, screenshot. The actions are executed in the sandboxed environment; the screenshot is then taken again and the loop continues.

                    6.2 The current implementations
                    
                        Anthropic Computer Use: first widely available implementation, shipping with Claude 3.5 Sonnet in October 2024 as research preview. Matured significantly through Claude Sonnet 4.5/4.6/4.7 and Claude Opus 4.6/4.7. Portable tool API that can run in any sandbox.
                        OpenAI Operator: launched January 2025 as research preview, ChatGPT Pro tier. Sunset July 17, 2025; capabilities folded into ChatGPT as "agent mode."
                        ChatGPT Agent Mode: merger of Operator's browser capabilities with deep research capabilities. Available across Pro, Plus, Team, Enterprise tiers. BrowseComp 68.9%.
                        OpenAI Codex Background Computer Use (April 16, 2026): desktop counterpart, macOS-first, supports parallel concurrent sessions.
                        Project Mariner / Gemini Computer Use: Google's browser agent line, optimized for DOM-aware web actions.
                        Microsoft Copilot Studio Computer Use: in preview as of late 2025/early 2026.
                        Phi-4-reasoning-vision-15B (Microsoft Research, March 2026): small open-weight option tuned for GUI grounding.
                    

                    6.3 Browser-only vs full-OS
                    Browser-only is the most common production mode: the agent can interact with a single browser, and that's it. The reasoning: browsers have well-understood security models, attack surface is bounded, the user can see what's happening.
                    Full-OS lets the agent operate the entire operating system: file manager, system settings, native applications, terminals, IDEs. Much more powerful and much more dangerous. Production deployments that allow full-OS agent access typically run in disposable sandboxes (VMs, containers) rather than directly on the user's primary machine.

                    6.4 GUI grounding: the hard part
                    The technical challenge underneath computer use is GUI grounding: given a screenshot and a high-level intent, output the right pixel coordinates to click. Two architectural approaches:
                    Pure vision (OpenAI's CUA, Anthropic's Computer Use, Phi-4-reasoning-vision-15B): the model sees pixels, outputs coordinates. No structured input about what's on the page. Works on anything visible on screen.
                    Accessibility tree / DOM-aware (Project Mariner / Gemini Computer Use): the model gets both pixels AND the page's structured representation. Makes element selection more reliable because the model can target by semantic role rather than pixel position.
                    Production agents often combine both: use the structured representation when it's available and trustworthy, fall back to pixels when it isn't.

                    6.5 Latency, reliability, and cost
                    The real constraints on production computer use:
                    
                        Latency. Each click is a tool call. A simple form fill that takes a human 30 seconds can take an agent 3-10 minutes.
                        Reliability. OSWorld benchmark results from early 2026 show humans at ~72.4%, leading models well below. Even the best computer use agents fail roughly a third of the time on tasks humans solve easily.
                        Cost. Vision-heavy prompts are expensive. A 50-step browser task can run several hundred screenshots through the model. A single complex computer-use task can cost $1-5.
                        Sandbox surface. Anthropic explicitly warns against running on the user's primary device. Recommended pattern is a VM or container, disposable.
                    

                    6.6 The "fall back to computer use" rule
                    The pragmatic position: prefer API tool calls when an API exists; fall back to computer use only when there's no other option. API calls are deterministic, fast, and cheap. Computer use is none of those.
                    The exceptions where computer use earns its place: the application has no API; the API is gated, expensive, or rate-limited in ways the screen interface isn't; the task involves visual judgment; the task is an end-user task on the user's behalf.
                    For enterprise agent deployments, the rough hierarchy in 2026: MCP server > direct API call > computer use. Computer use is the escape hatch when neither of the above works.
                

                
                    7. The vendor agent product landscape
                    What each major vendor sells as their commercial agent product. The lens is "what can an enterprise buy and deploy today?" not "what frameworks are available for building your own?" (Section 12 covers frameworks.)

                    7.1 Anthropic
                    Claude Cowork is Anthropic's flagship general-purpose agent product. Research preview January 30, 2026, expanded to enterprise February 2026. Desktop app (Mac and Windows, plus web interface) that gives Claude access to a sandboxed shell and user-selected folders. Can read, write, edit files; execute code; chain multi-step tasks; connect to external services through MCP connectors.
                    Claude Code is the terminal-based agent for developers. Production-grade, widely adopted, runs on Anthropic's harness with Claude Opus and Sonnet behind it. Roughly 4% of public GitHub commits as of early 2026 are authored by Claude Code.
                    Claude in Chrome is the browser-native agent, available as an extension.
                    Claude Managed Agents entered public beta on April 8, 2026. Hosted Claude Platform service for long-horizon agent work. Provides stable interfaces for sessions, harnesses, and sandboxes.
                    May 2026 additions on the Anthropic side:
                    
                        Claude Code Agent View: new CLI view that lets developers start agents, send them to the background, peek at status and last responses, and jump back in only when input is needed. Multi-session management built into Claude Code itself.
                        Claude for Small Business (May 13, 2026): toggle inside Claude Cowork. Ships with 15 ready-to-run agentic workflows across finance, operations, sales, marketing, HR, and customer service. Anthropic's bid to extend Cowork beyond enterprise into the SMB segment.
                        Vertical agent templates: 10 ready-to-run agent templates for financial services (pitchbooks, KYC, month-end close) and 12 practice-area plugins plus 20+ MCP connectors for legal work (research, contracts, discovery, matter management, legal aid). Legal professionals are now the most engaged Claude Cowork users of any knowledge-work function.
                        Subscription billing restructure announced May 14, 2026, effective June 15: separate billing pool for Agent SDK usage, a meaningful shift in how production agent traffic is priced.
                    

                    7.2 OpenAI
                    ChatGPT Agent Mode is the merged successor to OpenAI Operator and Deep Research. Consumer-facing browser-and-research agent, available across Pro, Plus, Team, Enterprise tiers.
                    Codex is the developer-facing agent line. Codex CLI runs in the terminal. Codex Background Computer Use (April 16, 2026) is the desktop variant. Codex IDE integration in VS Code and others.
                    Apps in ChatGPT (Apps SDK, October 2025) lets third parties build agents and integrations that surface inside ChatGPT.
                    Agents SDK (released March 2025) is the developer framework: handoffs, agents-as-tools, guardrails, tracing.
                    Codex now has 3M+ weekly developers as of the April 16, 2026 "Codex for (almost) everything" release. In May 2026 Codex landed inside the ChatGPT mobile app (iOS and Android, preview) so developers can monitor and manage running coding agents from their phones; macOS Background Computer Use remains the desktop-class option, gated outside the EEA, UK, and Switzerland at launch.

                    7.3 Microsoft
                    Copilot Cowork (GA May 1, 2026) is Microsoft's general-purpose desktop agent. Built on Anthropic's agentic harness, with a choice of underlying model. Bundled with the M365 E7 SKU at $99/user/month.
                    Agent 365 (May 1, 2026) is the agent control plane. Identity, observability, lifecycle, MCP server allowlists, Copilot Control System integration. $15/user/month standalone.
                    Copilot Studio is the no-code/low-code agent builder. Microsoft Foundry Agent Service is the developer platform. Azure Copilot agents: six specialist agents (migration, deployment, optimization, observability, resiliency, troubleshooting).

                    7.4 Google
                    Gemini Enterprise Agent Platform (announced at Google Cloud Next 2026) is the rebranded and significantly upgraded successor to Vertex AI. Four pillars: Build, Scale, Govern, Optimize.
                    Agent Development Kit (ADK) is the open-source framework underneath. Python, TypeScript, Go, Java. v1.0 May 2025. More than 6 trillion tokens per month are processed on Gemini models through ADK as of April 2026.
                    Agent Studio is the low-code visual agent builder. Agent Designer is the no-code natural-language interface. Gemini Enterprise app is the user-facing product where employees discover, share, and run agents.

                    7.5 AWS
                    Amazon Bedrock AgentCore is AWS's agent platform. GA October 2025. Sold as modular components:
                    
                        AgentCore Runtime: secure serverless hosting, microVM per session (Firecracker isolation), 100MB payloads, supports MCP/A2A/AG-UI. Filesystem persistence (preview April 2026). Managed harness (preview April 22, 2026). Node.js support added April 28, 2026.
                        AgentCore Memory: working/episodic/semantic/procedural primitives, exposed via API.
                        AgentCore Gateway: wraps existing APIs/databases/Lambda functions as MCP-accessible tools.
                        AgentCore Code Interpreter: secure code execution.
                        AgentCore Browser: secure browser environment, 25 browser automation tools, OS-level interaction added April 8, 2026.
                        AgentCore Identity: agent identity and credential management. OBO token exchange GA April 30, 2026.
                        AgentCore Observability: OpenTelemetry-compatible.
                        AgentCore Evaluations (GA March 2026): 13 built-in evaluators. Recommendations + A/B testing preview April 30, 2026.
                        AgentCore Payments (preview May 7, 2026): managed payment infrastructure for autonomous agents via x402 protocol, Coinbase + Stripe.
                        S3 Files and EFS mounts (May 2026): AgentCore harnesses can attach Amazon S3 Files and Amazon EFS access points at CreateHarness or UpdateHarness time, up to five mounts per harness, mounted into every session at a configured path.
                        AWS GovCloud (US-West) availability (May 2026): AgentCore generally available in the GovCloud region for regulated and public-sector workloads.
                    
                    Strands Agents is AWS's open-source agent SDK. Kiro is AWS's IDE for AI-assisted development. AWS Agent Registry (preview April 9, 2026) is the organization-private catalog.

                    7.6 Meta
                    Meta's agent strategy is more about platform infrastructure than packaged products. Llama Stack is the open-source agent framework around the Llama model family. Meta does not currently sell a first-party general-purpose agent product comparable to Cowork or ChatGPT Agent.

                    7.7 Enterprise SaaS vendors
                    The enterprise SaaS vendors have shipped their own agent products, mostly focused on their core domain:
                    
                        Salesforce Agentforce 360: agent platform built into the Salesforce ecosystem.
                        ServiceNow AI Agents: IT service management, employee experience, customer service.
                        SAP Joule Agents: ERP-domain workflows (finance, HR, supply chain).
                        Workday Illuminate: workforce planning, payroll, HR analytics.
                    

                    7.8 The picture in one frame
                    Five players (Anthropic, OpenAI, Microsoft, Google, AWS) own the platform layer. They differ in emphasis: Anthropic on harness quality, OpenAI on consumer breadth and developer SDKs, Microsoft on enterprise integration depth, Google on open standards and Workspace integration, AWS on infrastructure modularity. The enterprise SaaS vendors own their domains. Meta plays the open-weight + open-source framework role.
                

                
                    8. Evaluation: the hardest part

                    8.1 Why agent eval is qualitatively different from LLM eval
                    Evaluating a base LLM is well-understood: feed it a prompt, score the output against a reference. Evaluating an agent is fundamentally harder:
                    
                        Trajectory matters, not just endpoint. An agent that solves a task in 3 tool calls is meaningfully different from one that solves the same task in 47.
                        Multi-step compounding errors. A small per-step error rate produces a much higher per-task error rate.
                        Stochasticity. Run the same agent on the same task three times, you get three different trajectories.
                        Cost as a dimension. Agent cost varies enormously per trajectory.
                        State and side effects. Agents that modify external state can't always be run repeatedly without cleanup.
                    

                    8.2 The standard benchmarks
                    A non-exhaustive map:
                    
                        SWE-Bench / SWE-Bench Verified / SWE-Bench Pro: software engineering tasks from real GitHub issues. As of mid-May 2026, the Verified leaderboard top is Claude Mythos Preview (Anthropic limited release) at 93.9%, Claude Sonnet 5 at 92.4%, GPT-5.5 (released April 23, 2026) at 88.7%, Claude Opus 4.7 at 87.6%, GPT-5.3-Codex at 85.0%, with DeepSeek V4 Pro Max and Gemini 3.1 Pro tied at 80.6% as the open-weight and Google leaders respectively. SWE-Bench Pro (contamination-resistant): Claude Opus 4.7 at 64.3% (vs 87.6% on Verified). The gap between Verified and Pro is the contamination signal made concrete.
                        Terminal-Bench: terminal-based agentic tasks. April 2026: ForgeCode + Claude Opus 4.6 and ForgeCode + GPT-5.4 tied at 81.8%.
                        OSWorld: full computer-use benchmark. Humans ~72.4%; leading models well below.
                        GAIA: General AI Assistant, 466 tasks designed to be easy for humans (>92% human success) and hard for AI.
                        TAU-Bench: tool-use benchmark from Sierra, customer service scenarios.
                        WebArena: web-browsing tasks across realistic web environments.
                        MLE-Bench: 75 Kaggle-style ML competitions.
                        Finance Agent v1.1: Anthropic-backed financial agent benchmark. Claude Opus 4.7 reports 64.4%.
                    
                    No single benchmark gives the full picture. Production teams use a small portfolio and watch trends over time rather than chase any single number.

                    8.3 Judge-based and rubric eval
                    Many production tasks don't have a clean pass/fail signal. LLM-as-judge is the standard pattern: a separate, often more expensive model evaluates the agent's output against a rubric. The catches: judge-model bias (use cross-judge); rubric quality (vague rubrics produce noisy judgments); cost (adds 30-100% to eval cost).
                    Rubric-based eval without an LLM judge (regex, structured assertions, unit tests) is faster and cheaper and should be preferred when applicable. Most production eval pipelines blend both.

                    8.4 Production eval: the hard part
                    Benchmark performance does not predict production performance. The production eval stack typically has three layers:
                    Offline eval: curated test sets representing the real distribution of production traffic.
                    Online eval: sampling and scoring live production traffic. AWS AgentCore Evaluations (GA March 2026) explicitly supports online eval.
                    A/B testing: run two versions of the agent against fractions of production traffic, compare outcomes.

                    8.5 Trajectory-level evaluation
                    Trajectory evaluation scores not just the final answer but the path the agent took. An agent that solved the task in 3 tool calls is much better than one that took 47, even with the same final answer.
                    This is hard to operationalize. No consensus standard. Production teams that do it well typically write their own trajectory rubrics, score traces with LLM-as-judge, and watch the metric over time.

                    8.6 Production observability tools
                    
                        Langfuse: open-source-core observability for LLM and agent applications.
                        Helicone: observability and gateway. Strong on cost analysis.
                        AgentOps: agent-specific observability with strong support for trajectory analysis.
                        Arize Phoenix: observability stack with eval, tracing, and dataset management.
                        Braintrust: eval-focused, rubric-based scoring and dataset management.
                        Datadog LLM Observability, New Relic AI Monitoring, Splunk for AI: established APM vendors extended to LLM and agent observability.
                    

                    8.7 Cost evaluation
                    Production agent cost is a first-class evaluation metric:
                    
                        Deep research agents: $5-50 per task.
                        Coding agents: $1-20 per task.
                        Customer service agents: $0.10-1 per conversation.
                        Computer-use tasks: $1-5 per complex task.
                    

                    8.8 Open problems
                    
                        Trajectory-level correctness: no standard.
                        Cost-quality tradeoffs: no clean comparison.
                        Long-horizon task eval: prohibitive cost.
                        Adversarial eval: more art than science.
                        Cross-vendor portability: not directly comparable.
                        Contamination: SWE-Bench Verified showed how popular benchmarks get absorbed into training data.
                        Real-world generalization: the persistent gap between benchmark and production.
                    
                

                
                    9. Production patterns

                    9.1 Sandboxing
                    The answer in 2026 has converged on microVM isolation. The dominant technology is Firecracker, AWS's open-source microVM manager. Each microVM runs its own Linux kernel, isolated from the host by a hardware boundary (KVM virtualization). A guest exploit can't reach the host because there's a hardware barrier between them.
                    Production options as of April 2026: AgentCore Runtime, E2B, Anthropic Managed Agents, Docker Desktop 4.58 (microVM-per-agent via native hypervisors), Northflank, Modal, Fly.io, Gitpod, Kata Containers, gVisor.
                    The practical guidance: if your agent runs code or browses the web, run it in a microVM, not a container. Boot-time cost has dropped enough that per-invocation isolation is now viable.

                    9.2 Long-running tasks and durable state
                    The pattern that's emerged: explicit checkpointing. The agent periodically writes progress state to durable storage. If the loop terminates for any reason, the next invocation reads the checkpoint and resumes.
                    AgentCore Runtime filesystem persistence (preview March 25, 2026) preserves agent filesystem state across stop/resume cycles up to 1GB per session, retained for 14 days of idle time. The runtime handles it; the agent doesn't write checkpoint logic.

                    9.3 Async and dispatch patterns
                    The dispatch pattern: the user requests a task, the harness queues it, returns control immediately, and notifies the user when the task completes.
                    Claude Cowork Dispatch (announced 2026), OpenAI Codex Background Computer Use (April 16, 2026), AgentCore async invocation patterns. The trade-off: async agents need observability that synchronous ones don't.

                    9.4 Human-in-the-loop
                    Even highly autonomous agents typically have human approval points:
                    
                        Confirmation gates at destructive operations.
                        Approval workflows for full task plans.
                        Escalation patterns when the agent gets stuck.
                        Two-tier tool classification: low-risk actions execute without approval; high-risk actions halt and wait.
                    
                    The honest reality: human-in-the-loop adds friction. Production teams that initially gate everything end up gating less over time as trust builds.

                    9.5 Observability
                    You cannot improve what you cannot see. Agents run thousands of tool calls per session; without good traces, debugging is impossible. The core observability primitives: trace per session, OpenTelemetry compatibility, cost tracking per trace, quality metrics from production, anomaly detection.

                    9.6 Cost management
                    Strategies that move the needle:
                    
                        Prompt caching is the highest-leverage optimization. Cuts input costs 70-90% on the cached portion.
                        Model routing within the agent loop. Cheap, fast model for routine selection; reasoning model for hard planning.
                        Tool result truncation and compression.
                        Batch API for 50% discounts on non-real-time work.
                        Budget caps: hard limits on token spend per task.
                        Caching beyond the prompt: result caching, retrieval caching, tool-result caching.
                    

                    9.7 Latency strategies
                    
                        Streaming: first-token latency matters more than total latency for user-facing agents.
                        Parallel tool calls: when independent, do them in parallel.
                        Prompt caching for boilerplate.
                        Speculative execution: kick off likely-needed tool calls before the model formally requests them.
                        Smaller models for fast paths.
                        Batching tool calls when the tool supports it.
                    

                    9.8 The production stack you actually need
                    Minimum viable production agent infrastructure as of April 2026:
                    
                        Sandboxed execution environment (microVM-based, per-session isolation).
                        Tracing and observability (OpenTelemetry-compatible, cost tracking per trace).
                        Prompt caching turned on.
                        Budget caps and alerting on token spend, tool call count, wall-clock time per session.
                        Eval pipeline (offline test sets with regression detection, online eval sampling production traffic).
                        Human-in-the-loop gates at destructive operations.
                        Dispatch / async pattern for any task that takes more than a few seconds.
                    
                

                
                    10. Identity and authorization for agents

                    10.1 The problem
                    A traditional service account model (one identity per service, broadly scoped) doesn't fit agents well. An agent might run on behalf of multiple users in a single day. It might need different permissions for different tasks. The questions identity systems have to answer for agents:
                    
                        Who is this agent?
                        On whose behalf is it acting?
                        With what scope?
                        For how long?
                        Who is accountable?
                    

                    10.2 Microsoft Entra Agent ID
                    Microsoft Entra Agent ID is the identity framework for AI agents in the Microsoft ecosystem. Currently in PREVIEW as of April 2026, part of Microsoft Agent 365.
                    Three core constructs: Agent identity blueprint (reusable template), Agent identity (specific instance with Federated Identity Credentials), Agent's user account (optional, for "digital worker" scenarios).
                    Several design decisions worth noting: high-privilege Microsoft Graph permissions are blocked for agents; every agent identity has a sponsor (a human accountable for it); access packages support governance with expiration dates; OAuth 2.0, OpenID Connect, MCP, A2A all supported.

                    10.3 AWS AgentCore Identity
                    AgentCore Identity manages two flows. Inbound authentication: who can invoke the agent. AWS IAM (SigV4) for AWS-native, OAuth 2.0 for federated. Outbound authentication: agent accessing third-party services. Two modes: user-delegated or autonomous.
                    AgentCore Identity supports Microsoft Entra ID, Okta, Amazon Cognito, and standard OIDC providers.

                    10.4 OAuth on-behalf-of and RFC 8693 token exchange
                    The standards-based pattern underlying most agent delegation is OAuth 2.0 On-Behalf-Of with RFC 8693 Token Exchange. The flow preserves user-level audit (the action is attributed to the user) while making explicit that an agent did the work (the actor claim is the agent).
                    The catch: not every API supports the actor claim. Many APIs just check the subject and ignore the actor field.

                    10.5 Composite identity
                    A concept emerging in 2026 enterprise governance: composite identity: the tuple of (human principal + agent instance + task context + scope + expiration) as a single addressable thing.
                    No single vendor has fully implemented composite identity at the protocol level yet. The shape is approximated by Entra Agent ID's sponsor + agent identity + delegated permissions + token expiration combination, AgentCore's inbound identity + outbound credential + session ID combination, and A2A's agent identity + task delegation envelope + scope claim approach.

                    10.6 Practical implications
                    What enterprise teams deploying agents in 2026 are actually doing:
                    
                        Use a vendor-native identity primitive when possible.
                        Distinct identity per agent type.
                        Delegated tokens with short expiration.
                        Sponsor / accountability model.
                        Audit logging is non-negotiable.
                        Plan for token rotation.
                    
                

                
                    11. Governance and oversight

                    11.1 The producer/adopter split
                    Producers are the vendors who build configurable AI primitives (Microsoft, OpenAI, Anthropic, Google, AWS). Adopters are the enterprises who deploy those primitives in their specific business contexts. Compliance accountability lives primarily with the adopter, regardless of who built the agent.

                    11.2 Approval workflows and risk classification
                    Common classifications:
                    
                        Read-only / informational. Lowest-risk class.
                        Internal-write. Medium risk.
                        External / customer-facing. Higher risk.
                        Destructive or irreversible. Highest risk; per-action human confirmation typically required.
                    
                    Control patterns: pre-deployment review, runtime approval gates, post-action review, lifecycle management.

                    11.3 Vendor control planes
                    
                        Microsoft Agent 365: most fully realized example. Sits inside Microsoft Copilot Control System (CCS).
                        AWS Agent Registry: AWS's central catalog primitive.
                        AgentCore-native governance: identity, observability, approval primitives as a set of primitives.
                        Google Gemini Enterprise Agent Platform governance: single control plane spanning no-code, low-code, code-first agents.
                        Anthropic ships agent governance through Claude for Enterprise, Claude Managed Agents controls, and broader Claude API admin tooling.
                    

                    11.4 Compliance frameworks
                    
                        EU AI Act: in force August 1, 2024. Full enforcement for high-risk AI in Annex III begins August 2, 2026. Penalties up to €35M or 7% of global annual turnover.
                        NIST AI Risk Management Framework: voluntary, US-centric. Four core functions: Govern, Map, Measure, Manage. NIST launched a dedicated initiative in February 2026 for autonomous AI agents.
                        ISO/IEC 42001: international standard for AI management systems. Third-party certification.
                        Singapore Model AI Governance Framework: first agentic-AI-specific governance framework, January 2026.
                        SOC 2, HIPAA, GLBA, FedRAMP, PCI DSS: industry-specific implications.
                    

                    11.5 Shadow agents
                    The shadow IT problem has an agent-era version: employees using consumer AI agents on work data without IT approval; engineers spinning up agent prototypes with permissions that exceed policy. Mitigations are familiar: sanctioned alternatives, network and identity-level visibility, policy as code, education.

                    11.6 Liability and accountability
                    The unsettled area. Current state of the practice:
                    
                        The adopter is generally accountable.
                        The producer can be liable for negligence (largely untested in court).
                        Insurance is forming around this; AI liability insurance products emerging in 2026.
                        Indemnification clauses cover IP infringement claims; agentic actions less consistently covered.
                        Audit trail as defense.
                    

                    11.7 What good agent governance actually looks like
                    
                        Distinct agent identity with a named human sponsor.
                        Bounded scope (least-privilege permissions).
                        Approval gates at destructive operations.
                        Comprehensive audit logging.
                        Observability integrated with the company's monitoring stack.
                        Lifecycle management.
                        Controls mapped to compliance frameworks the organization is subject to.
                        Incident response plan for agent failures.
                        Documentation of what the agent does, what data it accesses, what controls are in place.
                    
                

                
                    12. Frameworks compared

                    12.1 The "you might not need a framework" position
                    For many agent tasks, no framework is needed. The minimal viable agent is a while loop, an LLM client, a tool registry, and some glue code. Anthropic's "Building Effective Agents" post argues this position explicitly.
                    When raw API + good prompts is enough: single agent with a small set of tools (under 10); linear or near-linear task flow; stateful behavior bounded to a single session; no multi-agent coordination needed.
                    When a framework starts earning its place: 20+ tools that need routing strategy; complex state machines with conditional flow and cycles; long-running tasks needing checkpointing; multi-agent orchestration; built-in observability that beats writing your own.

                    12.2 LangGraph
                    The most production-battle-tested open-source agent framework as of May 2026. Released by LangChain, reached 1.0 GA in October 2025; now at v1.2.0 with deep agent templates, distributed runtime support in the CLI, checkpoints, streaming, human-in-the-loop, Studio, and Postgres integration. About 25K GitHub stars and 34.5M monthly downloads. Real production users: Uber, Klarna (85M users), LinkedIn, JPMorgan. Core abstraction: agents as directed graphs.
                    Good at: complex stateful workflows, built-in checkpointing ("time travel" debugging), streaming per-node tokens, LangSmith observability integration, human-in-the-loop patterns first-class.
                    Hard at: steepest learning curve of the major frameworks, verbose for simple tasks, less polished MCP/A2A support than newer frameworks.

                    12.3 CrewAI
                    Role-based crew abstraction inspired by real-world organizational structures. Define agents with roles, goals, backstories; assemble into a crew; assign tasks. About 44,600 GitHub stars, v1.14.4 as of April 2026 with native MCP and A2A support (A2A and streaming added in v1.10).
                    Good at: lowest barrier to entry, multi-agent collaboration as primary design assumption, native MCP and A2A, rapid prototyping.
                    Hard at: simplicity abstracts away orchestration details that matter at scale, less mature checkpointing.

                    12.4 OpenAI Agents SDK
                    Released March 2025, replaced experimental Swarm. Production-grade with deliberately simple primitives: Agents, Tools, Handoffs, Guardrails, Tracing. v0.10.2 (April 2026); works with 100+ non-OpenAI models.
                    Good at: cleanest handoff model of any framework, built-in tracing in OpenAI dashboard, guardrails as first-class primitive, lowest boilerplate.
                    Hard at: more opinionated than LangGraph, state management less mature, best with OpenAI models.

                    12.5 Anthropic Claude Agent SDK
                    v0.1.48 as of April 2026. Less of a multi-agent orchestration framework; focuses on making a single Claude agent extraordinarily capable through managed sessions, container environments, skills, and persistent state.
                    Good at: tightest integration with Claude's harness, Skills, computer use, extended thinking; MCP-native; strong safety story; lifecycle control built in.
                    Hard at: tied to Claude models; multi-agent at application level not framework primitive; smaller ecosystem.

                    12.6 AWS Strands Agents
                    AWS's open-source agent SDK. Framework-agnostic deployment to AgentCore Runtime but runs anywhere.
                    Good at: native MCP support, tight integration with AgentCore primitives, used internally at Amazon, handles streaming and parallel agents well.
                    Hard at: smaller community, less rich documentation, AWS-centric in practice.

                    12.7 Google Agent Development Kit (ADK)
                    v1.26.0. Python, TypeScript, Go, Java. Hierarchical agent tree as core abstraction.
                    Good at: native A2A protocol support (best for cross-vendor interop), multimodal first-class via Gemini, tight integration with Google Cloud.
                    Hard at: newer than LangGraph and CrewAI, best with Gemini models, GCP-centric tooling.

                    12.8 Microsoft Agent Framework
                    Microsoft merged AutoGen and Semantic Kernel into the Microsoft Agent Framework. Reached Release Candidate February 19, 2026; 1.0 GA shipped April 3, 2026. Production-ready, open-source, .NET and Python. Stable APIs, versioned releases, long-term support commitment. Includes full MCP integration for dynamic tool discovery, graph-based workflows, A2A support, streaming, checkpointing, human-in-the-loop.
                    AutoGen is now in maintenance mode; Semantic Kernel's enterprise patterns absorbed into the new framework.

                    12.9 The smaller frameworks worth knowing
                    
                        Letta (formerly MemGPT): memory-first agent framework. Letta Code leads Terminal-Bench among model-agnostic agents.
                        LlamaIndex Agents: RAG-first.
                        Vercel AI SDK: TypeScript-first, web-application-oriented.
                        Mastra: TypeScript with built-in memory, agents, workflows.
                        PydanticAI: Python with Pydantic models as schema language.
                        BeeAI Framework (IBM): enterprise integration with IBM watsonx.
                        OpenAgents: native MCP + A2A simultaneously.
                    

                    12.10 Decision matrix

                    
                        
                        Framework choice by situation. They mostly can all do the same things; ecosystem fit dominates.
                    

                    12.11 The practical take
                    The framework decision matters less than the harness, the model, the eval, and the production stack you build around it. A team that picks the "wrong" framework but invests heavily in eval and observability will outperform a team that picks the "right" framework but skips those investments. Pick something reasonable, ship, iterate.
                    The framework you start with is rarely the framework you finish with. Most production systems get rewritten 1-2 times in their first year as the team learns what they actually needed. Optimize for "easy to migrate from" rather than "perfect choice forever."
                

                
                    13. The agentic computing paradigm shift

                    13.1 The "models that take actions" framing
                    The cleanest way to describe the shift: AI used to produce text and images; now it takes actions. The text-and-images era was the chatbot era. The actions era is the agent era. Anthropic, OpenAI, Microsoft, Google, and AWS have all converged on this framing in different words.
                    Frank Lamanna, Microsoft's CVP for Microsoft 365, has used "fire-and-forget" as the operational metaphor: the user kicks off a task, the agent executes it without continuous supervision, the user gets the result later. The shift from "type, wait, see result" to "dispatch, do something else, get notified" is what fire-and-forget captures.

                    13.2 The agentic OS framing and the Karpathy critique
                    A more speculative framing: the agentic operating system. The idea is that the operating system of the next computing era is whatever agent runs on top of the OS and orchestrates everything else.
                    Andrej Karpathy is one of the more thoughtful commentators. Through 2024 and into 2025 he coined "agentic slop" for early agent demos. He publicly changed his stance after extended use of Claude Code. His continuing pushback: agents in 2026 are still better at narrow tasks than at general orchestration. The "single agent for everything" story may be a story we keep telling and never quite live up to.
                    Whether the framing is right or not, the platform vendors are building as if it is. Microsoft Copilot, Apple's expanded AI agent capabilities, Google's Gemini Enterprise app, AWS AgentCore (infrastructure for an "agent mesh") all align with the agentic-OS thesis, even if the strict framing turns out to be wrong.

                    13.3 World models and the second axis of scaling
                    A different thread: world models. Yann LeCun's thesis: LLMs by themselves are insufficient for general intelligence; the architecture has to change. LeCun's bet is on JEPA (Joint Embedding Predictive Architecture). Meta's V-JEPA 2 (June 2025) is the strongest demonstration to date, a 1.2 billion parameter world model trained on 1 million hours of video.
                    Meta's SuperIntelligence Lab is reportedly developing Mango (image/video generation, world-model lineage) and Avocado (text and coding) for H1 2026. If world models prove out for agentic tasks beyond robotics, the architectural picture shifts.

                    13.4 Test-time compute as a new scaling axis
                    Pre-2024, the dominant scaling strategy was bigger pretraining. Reasoning models opened a second axis: pay more inference compute per query for higher-quality answers. This matters for agents because per-step decision quality compounds across the loop.
                    DeepSeek R1 (January 2026) was the first frontier-quality reasoning model with open weights. The release shifted understanding of where the frontier sits. Open-weight reasoning models multiplied through 2026 (DeepSeek V3.2, Qwen3-Coder, MiniMax M2.5, GLM-5, Phi-4-reasoning family, Kimi K2 Thinking).

                    13.5 Agentic search
                    Traditional search retrieves documents matching a query. Agentic search reasons about what the user wants, breaks the query into sub-questions, dispatches multiple searches in parallel, evaluates results, and produces a synthesized answer with citations.
                    This shows up across vendors: Perplexity built its entire product around it; OpenAI Deep Research; Anthropic Claude with web search; Google Gemini Enterprise's Deep Research; Microsoft Copilot Researcher.
                    The economic significance is real: traditional search is the foundation of Google's $300B+ ad business. Agentic search collapses the search-results-page step. This changes the fundamental ad economics.

                    13.6 Coding agents as the leading edge
                    Coding has been the canonical leading-edge use case in 2025-2026. Verifiable correctness (code passes tests or it doesn't), rich tool ecosystem, bounded cost of failure, engineer adoption.
                    Result: Claude Code at ~4% of public GitHub commits, GitHub Copilot at 26 million users, Letta Code leading Terminal-Bench among model-agnostic agents. SWE-Bench Verified scores climbed from 4.9% (GPT-4o, mid-2024) to 87.6% (Claude Opus 4.7, April 2026) and 93.9% (Claude Mythos Preview, limited release).

                    13.7 What's still hard
                    
                        Long-horizon reliability.
                        Cross-domain generalization.
                        Cost at scale.
                        Identity propagation across systems.
                        Governance interop across vendors.
                        True autonomy without supervision.
                        Visual and physical reasoning.
                        Adversarial robustness (prompt injection and tool poisoning have no clean defenses).
                    

                    13.8 The 40% number
                    Gartner published a forecast in 2025 that "over 40% of agentic AI projects will be canceled by the end of 2027" due to escalating costs, unclear business value, and inadequate risk controls.
                    The practitioner consensus: the 40% number is plausible but doesn't mean what the headlines suggest. Many projects launched in 2024-2025 are over-scoped, under-evaluated, and built on the assumption that more autonomy is better. The 40% cancellation rate is a normal early-cycle adjustment.

                    13.9 The honest big-picture take
                    Agentic computing in 2026 is real, important, and oversold in different parts simultaneously. The capability gains over 2024 are large. The productization gains over 2025 are larger. The governance and eval gaps are real. The frame of reference shift, from chatbots to agents, has happened.
                

                
                    14. Hands-on: building your first agent
                    This section walks through building an agent in six progressive stages. The task is held constant: answer a research question with cited sources.

                    14.1 Step 1: Raw Anthropic API with tool calling
                    Start with no framework, no MCP, no abstraction. Just the API, a loop, and one tool.
                    import anthropic
import json

client = anthropic.Anthropic()

tools = [{
    "name": "web_search",
    "description": "Search the web. Returns a list of {title, url, snippet}.",
    "input_schema": {
        "type": "object",
        "properties": {"query": {"type": "string"}},
        "required": ["query"]
    }
}]

def web_search(query: str) -> list:
    return [{"title": "Example", "url": "https://example.com",
             "snippet": "Relevant text about " + query}]

def run_agent(question: str, max_turns: int = 10) -> str:
    messages = [{"role": "user", "content": question}]
    for turn in range(max_turns):
        response = client.messages.create(
            model="claude-opus-4-7",
            max_tokens=4096,
            tools=tools,
            messages=messages,
        )
        if response.stop_reason == "end_turn":
            return next(b.text for b in response.content if b.type == "text")
        messages.append({"role": "assistant", "content": response.content})
        tool_results = []
        for block in response.content:
            if block.type == "tool_use" and block.name == "web_search":
                result = web_search(block.input["query"])
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": json.dumps(result),
                })
        messages.append({"role": "user", "content": tool_results})
    return "Hit max turns without conclusion."
                    This is ReAct in its most stripped-down form. Everything else is built on top of this loop.

                    14.2 Step 2: Replace the hand-rolled tool with MCP
                    Swap the hand-rolled web_search with an MCP server. The tool definition and execution move out of your code into a separate process speaking MCP. Adding a second tool becomes a config change.

                    14.3 Step 3: Add memory
                    Two layers: per-session scratchpad (the agent can write notes to itself) and a vector store for cross-session recall. Expose remember() and recall() as tools.

                    14.4 Step 4: Add planning
                    Before doing anything, the agent writes out a plan, then executes against it. Compare with pure reactive on the same task: planned wins when structure is identifiable upfront; reactive wins when the right path depends on observation.

                    14.5 Step 5: Move to a framework
                    Do the same task in CrewAI and LangGraph. CrewAI reads as roles and tasks; LangGraph reads as nodes and edges. Same task; different intuition.

                    14.6 Step 6: Deploy to a managed runtime
                    Anthropic Managed Agents (public beta April 8, 2026) or AWS AgentCore Runtime. The runtime gives you Firecracker microVM isolation, MCP/A2A/AG-UI support, filesystem persistence, identity, memory primitives, and observability.

                    14.7 What's missing
                    Eval pipeline, cost and latency observability, identity (OAuth on-behalf-of), approval gates, governance. The number of agent demos that look great and then never ship is large; almost all of them stalled at the gap between Step 6 and real production.
                

                
                    15. Where this is going / open questions
                    The format is honest open questions where smart people disagree, with the disclaimer that twelve months from now this section will be either obsolete or visibly wrong on at least half its predictions.

                    15.1 Will the framework space consolidate?
                    Two scenarios: consolidation (collapse to 2-3 winners) or sustained fragmentation. The honest answer is "probably partial consolidation by 2027." Some frameworks will absorb others (the AutoGen merger pattern). Some will remain because they fit specific niches well.

                    15.2 Is the agentic OS framing right?
                    Probably "both." General-purpose agents will exist for tasks that span many domains. Specialized agents will dominate domain-deep workflows. The agentic OS framing oversells the unification but isn't entirely wrong.

                    15.3 What's the eval situation in twelve months?
                    Likely state: incremental progress, no breakthrough. Better trajectory rubrics, more contamination-resistant benchmarks, but the gap between benchmark and production persists. The wildcard: a real breakthrough in automated trajectory eval.

                    15.4 How does identity propagation scale?
                    The hard case: an agent invoked by user A delegates to an agent at company B which delegates to an API at company C. Cross-vendor agent flows are going to get worse before they get better.

                    15.5 Will cost-at-scale work?
                    Unit economics will work for high-value tasks (legal, financial, executive support). Will struggle for low-margin tasks (basic customer service, simple form filling). The divide itself will persist.

                    15.6 What do 2027-2028 models look like?
                    Candidates: world models (Meta investing heavily); bigger context plus better attention; stronger procedural memory; better cost-per-token at frontier quality; novel architectures (state-space models, MoE at very large scale, hybrid).

                    15.7 The questions worth tracking
                    
                        Which framework do production teams pick when starting fresh in 2026 vs. 2027?
                        What's the per-task cost trend for representative agent workloads?
                        How are governance violations actually surfacing?
                        Which capabilities cross from "demo" to "production-reliable"?
                        What does the standards picture look like?
                    
                

                
                    16. Resources

                    16.1 Foundational papers
                    
                        ReAct: Yao et al., October 2022. arXiv:2210.03629.
                        Reflexion: Shinn et al., March 2023. arXiv:2303.11366.
                        Tree of Thoughts: Yao et al., May 2023. arXiv:2305.10601.
                        MemGPT: Packer et al., October 2023. arXiv:2310.08560.
                        Lost in the Middle: Liu et al., 2023. arXiv:2307.03172.
                        Chain-of-Thought: Wei et al., January 2022. arXiv:2201.11903.
                    

                    16.2 Vendor documentation
                    Anthropic: Building Effective Agents, Claude API tool use, Prompt caching, Computer use, Claude Agent SDK.
                    OpenAI: Agents SDK Python docs, Function calling.
                    Microsoft: Agent Framework, Entra Agent ID, Copilot Studio.
                    Google: Agent Development Kit, Gemini Enterprise Agent Platform.
                    AWS: AgentCore overview, AgentCore developer guide, Strands Agents.

                    16.3 Frameworks (primary GitHub repos)
                    
                        LangGraph
                        CrewAI
                        OpenAI Agents SDK
                        Claude Agent SDK
                        AWS Strands Agents
                        Google ADK
                        Microsoft Agent Framework
                        Letta
                    

                    16.4 Standards and protocols
                    
                        Model Context Protocol (see the MCP dossier)
                        Agent2Agent (A2A)
                        AG-UI
                    

                    16.5 Eval and benchmarks
                    
                        SWE-Bench, SWE-Bench Pro
                        Terminal-Bench, OSWorld
                        GAIA, WebArena
                    

                    16.6 Observability tools
                    
                        Langfuse
                        Helicone
                        AgentOps
                        Arize Phoenix
                        Braintrust
                    

                    16.7 Governance and compliance
                    
                        EU AI Act
                        NIST AI Risk Management Framework
                        ISO/IEC 42001
                        OWASP Top 10 for LLM Applications
                    

                    16.8 Blogs, newsletters, people to follow
                    
                        Anthropic Engineering
                        Latent Space
                        Simon Willison
                        Andrej Karpathy
                        Chip Huyen
                        Hamel Husain
                        Lillian Weng
                        AI Snake Oil / Arvind Narayanan
                    

                    16.9 Where to start if you're new
                    
                        Read Anthropic's "Building Effective Agents." Short, opinionated, mostly survives contact with reality.
                        Build the Section 14 walkthrough. Six steps, one afternoon, real understanding.
                        Pick one production deployment to follow. Klarna's LangGraph, Anthropic's Claude Code, Microsoft Copilot's enterprise rollout.