MODULE 10

Production AI Agents

Module 3 introduced ReAct — reasoning plus single-tool actions. This module is about what agents look like in production: multi-step plans, dozens of tools, robust error recovery, agent-to-agent coordination, and the discipline of keeping a 50-step process from collapsing.

3 lessons·~120 minutes

LESSON 10.1

Tool ecosystems and the design of large tool sets

A toy agent has 3 tools. A production agent has 30. As the toolset grows, the agent's ability to pick the right tool degrades — not linearly, but with a sharp drop-off around 15-20 tools depending on the model. This lesson is about designing tool sets that scale.

The "tool drawer" problem

When the model has too many tools, three failure modes emerge:

Wrong tool picked. Two tools have overlapping descriptions; the model picks the one that fits 80% of cases instead of the one for the 20% case.
Tool ignored. The model goes to its parametric knowledge instead of using the tool that would have been authoritative.
Combinatorial paralysis. With 40 tools, the model spends tokens listing every option in its reasoning chain.

The fix isn't more clever prompts. It's better organization.

Hierarchical tool exposure

Don't give the agent all 30 tools at once. Give it a top-level "router" tool that decomposes the request, then load the relevant subset:

// Top-level tools (always available)
- search_knowledge_base
- categorize_request   // returns one of: billing, technical, account, other

// Conditionally loaded based on category
billing_tools:
  - get_invoice, refund_payment, update_payment_method, get_subscription
technical_tools:
  - check_system_status, search_known_issues, escalate_to_engineering
account_tools:
  - reset_password, update_email, close_account

Two-phase agent: first call decides the category, second call gets the relevant tools and acts. Each call has 5-8 tools at most. Both perform far better than a single call with 25 tools.

Frameworks like LangGraph, AutoGen, and CrewAI formalize this; you can also build it yourself in ~100 lines of orchestration code.

Tool description discipline

Tool descriptions are prompts. Treat them with the same rigor.

Bad:

{
  "name": "send_email",
  "description": "Sends an email",
  "input": { "to": "string", "body": "string" }
}

Better:

{
  "name": "send_email",
  "description": "Send an email from the authenticated user's account. Use this for one-off transactional emails (purchase confirmations, password resets, status updates). Do NOT use for marketing emails or newsletters — those require explicit user consent and go through send_marketing_email instead. Always confirm the email content with the user before sending unless they've pre-approved the action.",
  "when_to_use": [
    "User asks to send a specific email",
    "An automated workflow needs to notify someone"
  ],
  "when_not_to_use": [
    "Sending to multiple recipients (use send_bulk_email)",
    "Marketing content (use send_marketing_email)",
    "Confidential information (use secure_send_email)"
  ],
  "input": {
    "to": "Single email address (string, validated)",
    "subject": "Subject line, plain text, max 100 chars",
    "body_markdown": "Markdown-formatted body; will be rendered to HTML",
    "category": "transactional | system | status_update"
  }
}

The verbose version costs tokens but eliminates entire classes of misuse. Worth it for any tool that takes consequential action.

Disambiguate by intent, not implementation

Don't name tools after how they work. Name them after what they accomplish.

Implementation-named (bad)	Intent-named (good)
postgres_query, redis_get	find_customer, get_session_data
call_stripe_api	refund_payment
send_via_sendgrid	send_email
shell_exec	(don't expose this directly to agents)

The model picks tools based on description matching the task. "Find the customer with email X" matches `find_customer` cleanly; it has to do extra work to map onto `postgres_query`.

Avoid generic escape hatches

The single most dangerous pattern in agent tooling: a generic `run_code` or `execute_query` tool. The model will use it for everything because it's the most flexible — and you've just given it RCE-equivalent access to your infrastructure.

If you need flexibility, sandbox aggressively (no network, no file system, no env vars) and rate-limit. Better: don't expose generic execution to agents at all. Build the specific tools you actually need.

Rule of thumb: if a tool's blast radius is larger than the agent's mandate, it shouldn't exist in your toolset. Constrain the tools, not the agent.

LESSON 10.2

Planning, replanning, and recovery

ReAct works for 3-5 step tasks. Production agents routinely need 20+ steps to complete a task — and any step can fail. This lesson is about how to keep a long-running agent from collapsing under its own context window.

Plan-then-execute

For tasks longer than ~5 steps, separate the planning phase from the execution phase:

// Phase 1: Plan
plan = llm(planning_prompt, task, available_tools) → list of steps

// Phase 2: Execute
for step in plan:
    result = llm(execution_prompt, step, tools_for_this_step)
    if result.failed:
        plan = replan(task, completed_steps, failed_step, error)
        // restart at the right point

The planning model can be smaller/cheaper since it just produces structured output. The execution model is where you spend on quality.

A reasonable planning prompt:

You are planning how to complete this task:
{task}

Available tools:
{tool_summaries}

Produce a JSON plan as a list of steps. Each step has:
  - id (1, 2, 3...)
  - description (one sentence)
  - depends_on (array of step ids that must complete first)
  - tools (array of tool names this step might use)
  - success_criteria (how to know this step is done)

Constraints:
- 3-10 steps total. If a task seems to need more, you may be over-decomposing.
- Steps should be parallelizable where possible (depends_on=[])
- Each step should be small enough that a focused executor can finish it without further planning.

Output JSON only.

The "scratchpad" pattern

Long-running agents accumulate context — every tool call response, every reasoning step. Past ~30 steps, the relevant signal is buried in noise. The fix: maintain a structured scratchpad.

{
  "task": "Find and refund the duplicate charge for customer X",
  "completed_steps": [
    { "id": 1, "summary": "Located customer record: id=cus_abc123" },
    { "id": 2, "summary": "Found duplicate charges: ch_1, ch_2 (both $99 on 2026-03-15)" }
  ],
  "current_findings": {
    "customer_id": "cus_abc123",
    "duplicate_charges": ["ch_1", "ch_2"],
    "original_charge": "ch_1",  // earlier timestamp
    "duplicate_to_refund": "ch_2"
  },
  "next_step": { "id": 3, "description": "Issue refund for ch_2" }
}

The agent gets this scratchpad instead of (or alongside) the full conversation history. Massive token savings. Better reasoning, because the relevant state is structured.

The scratchpad updates between steps via either (a) the agent emitting a structured update or (b) a separate "summarizer" model call. Option (b) is more reliable.

Error recovery taxonomy

When a tool call fails, the agent needs to know how to react. Different errors warrant different responses:

Error type	Right response
Transient (rate limit, timeout)	Retry with backoff (2-3 tries)
Invalid input	Reformulate the call, don't retry verbatim
Permission denied	Surface to user; don't escalate without authorization
Resource not found	Branch the plan — was the resource the goal, or a prerequisite?
Tool returned wrong data	Try alternative tool or replan
Unknown error	Halt and surface to a human

Encode this in the error responses you give the agent. Don't just return raw error strings — return error type + actionable hint.

// Bad
return { "error": "API returned 429" }

// Good
return {
  "error_type": "transient",
  "message": "Rate limited; the API allows 60 req/min. Last reset in 23 seconds.",
  "suggested_action": "wait_and_retry",
  "wait_seconds": 23
}

Termination conditions

Long agents need multiple safeguards against runaway execution:

Max steps: hard cap (e.g., 25). Refusing to finish is better than infinite spending.
Token budget: cumulative cost limit per task (e.g., $1.00).
Wall-clock timeout: 5 minutes max for interactive, 30 minutes for background.
Progress check: if the last 3 steps didn't change the scratchpad meaningfully, halt and surface to human.
Confidence threshold: if the agent's stated confidence in its plan drops below a threshold, halt.

All four trip independently. Whichever triggers first wins.

Human checkpoints

For high-consequence actions (financial transactions, account changes, sending external communications), put humans in the loop:

Confirmation: agent prepares the action, presents to user, waits for "yes/no" before executing
Review-after: agent acts, logs everything, human reviews periodically
Approval queue: agent submits to a queue; a human approves in bulk

Match the friction to the consequence. Confirmation for every action is annoying; review-after for refunds-over-$1000 is appropriate.

Exercise: For an agent task you've built or are designing, list every step that could fail. For each, write what the error response should tell the agent. This is more useful than another prompt tweak.

LESSON 10.3

Multi-agent systems and orchestration

The next step up from a single agent: multiple agents that hand work to each other. This unlocks specialization (each agent has its own narrow prompt and toolset) and parallelism (agents work simultaneously). It also unlocks new failure modes. This lesson is about when to reach for multi-agent and how to keep it from spiraling.

When multi-agent is the right answer

A single agent with a long prompt and many tools can do almost anything. So when do you split into multiple agents?

Good reasons:

Specialization: each agent has a coherent role with focused prompts. A "researcher" agent + "writer" agent + "editor" agent is easier to maintain than one mega-prompt that does all three.
Parallelism: three independent subtasks can run simultaneously, cutting end-to-end latency.
Different models: the researcher needs fast, cheap retrieval; the writer needs the most capable model. Splitting lets you tier costs.
Audit: each agent's output is a checkpoint; easier to debug than a 50-step monolithic trace.

Bad reasons:

"Multi-agent sounds more impressive than a single agent" — actually the opposite is usually true; single agents are often more reliable
"Agents will negotiate among themselves" — they won't, in any meaningful way. They'll just generate plausible negotiation text.
"Each agent will have its own personality" — adds variance without benefit for most use cases

Common multi-agent patterns

1. Pipeline (sequential). Output of agent A feeds agent B feeds agent C. Each agent is specialized. Classic for content workflows: research → draft → edit → publish.

2. Manager-worker (orchestration). A "manager" agent decomposes the task and dispatches subtasks to "worker" agents. Workers don't know about each other; they just receive task → return result. The manager assembles.

3. Parallel + aggregator. Same task fanned out to multiple agents that process in different ways; results combined. Good for "draft 5 options" or "gather opinions from different perspectives."

4. Debate. Two agents argue opposing positions; a third evaluates. Useful for high-stakes decisions where you want explicit consideration of trade-offs.

Start with pipeline. It's by far the most common and most reliable.

The communication problem

Multi-agent setups fail in characteristic ways at the agent-to-agent communication layer:

Telephone game. Information degrades as it passes between agents. Each summarizes the previous output, losing nuance.
Loose coupling drift. Agent B receives output that doesn't quite match what it expected. Without strict schemas, errors compound.
Cost explosion. Each handoff is a new LLM call with full context. 5 agents × 10 steps = 50 calls minimum.

The fixes:

Structured handoffs. Each agent's output conforms to a schema the next agent expects. Validate at boundaries. No prose-only handoffs.
Shared scratchpad. Like single-agent scratchpad but accessible to all agents. The "ground truth" of the task state. Each agent reads + writes specific fields.
Smaller models for handoff agents. A "router" that decides which worker to invoke doesn't need GPT-4. Haiku or Mini works fine.

Orchestration frameworks: tools, not solutions

LangGraph, CrewAI, AutoGen, Anthropic Agent SDK — all give you scaffolding for multi-agent systems. They handle the loops, the state, the tool dispatching. They don't:

Make your agents reliable. That's still your job (prompts, tools, evals)
Make your system maintainable. Multi-agent systems are inherently harder to debug than single-agent.
Save you from designing the data flow. The orchestration framework only knows what you tell it.

Pick a framework based on your stack and the agent topology you need, not based on which has the best demos. Better yet: build single-agent systems until you genuinely hit the wall, then introduce multi-agent only where it pays off.

Evaluation: the hardest part

Multi-agent systems are stochastic at every hop. An eval that tests "did the final output meet the spec" misses where the system broke when it does. Better:

Trace replays. Save the full multi-agent trace for every eval run. Compare trajectories, not just outputs.
Per-agent evals. Each agent gets its own eval set with expected handoff formats. You can swap out implementations of one agent without breaking the system if the schema holds.
System-level evals. End-to-end behavior on the user-facing task. The number that actually matters.

When to ship vs. when to wait

Multi-agent is a rapidly-evolving area. Frameworks are immature. Best practices are forming. Costs are higher than they should be.

If you're shipping a product, lean toward single-agent + good tools. Multi-agent is the right call when:

You've genuinely hit the limit of single-agent quality and have evals proving it
The workflow has natural specialization (different skills needed at different stages)
The cost overhead is justified by the value of the output
You have the infrastructure to maintain a more complex system

For most production AI features in 2026, a well-designed single agent with 8-12 tools beats a 5-agent system. That ratio will shift as orchestration tools mature, but right now: simpler is better.

Exercise: Sketch a 3-agent pipeline (research → draft → edit) for a workflow you care about. Then sketch the same workflow as a single agent with all three sub-roles in the system prompt. Run both on the same 10 inputs. Compare quality, cost, latency, and debuggability. The single-agent version often wins on at least three of those.

Module 10 wrap-up

You now have the production agent toolkit: tool design that scales, planning that survives long tasks, error recovery with intent, and multi-agent orchestration with eyes open about the costs. The patterns in this module are the highest-leverage skills in current AI engineering — they're what separates "a demo that works" from "a system that runs."

The course ends here. From here, you build. The Projects section gives you three capstone projects to apply everything from Modules 1-10. Pick one, ship it, and you've turned this course into something you can put on your resume.