MODULE 10

Production AI Agents

Module 3 introduced ReAct — reasoning plus single-tool actions. This module is about what agents look like in production: multi-step plans, dozens of tools, robust error recovery, agent-to-agent coordination, and the discipline of keeping a 50-step process from collapsing.

3 lessons·~120 minutes
LESSON 10.1

Tool ecosystems and the design of large tool sets

A toy agent has 3 tools. A production agent has 30. As the toolset grows, the agent's ability to pick the right tool degrades — not linearly, but with a sharp drop-off around 15-20 tools depending on the model. This lesson is about designing tool sets that scale.

The "tool drawer" problem

When the model has too many tools, three failure modes emerge:

  1. Wrong tool picked. Two tools have overlapping descriptions; the model picks the one that fits 80% of cases instead of the one for the 20% case.
  2. Tool ignored. The model goes to its parametric knowledge instead of using the tool that would have been authoritative.
  3. Combinatorial paralysis. With 40 tools, the model spends tokens listing every option in its reasoning chain.

The fix isn't more clever prompts. It's better organization.

Hierarchical tool exposure

Don't give the agent all 30 tools at once. Give it a top-level "router" tool that decomposes the request, then load the relevant subset:

// Top-level tools (always available)
- search_knowledge_base
- categorize_request   // returns one of: billing, technical, account, other

// Conditionally loaded based on category
billing_tools:
  - get_invoice, refund_payment, update_payment_method, get_subscription
technical_tools:
  - check_system_status, search_known_issues, escalate_to_engineering
account_tools:
  - reset_password, update_email, close_account

Two-phase agent: first call decides the category, second call gets the relevant tools and acts. Each call has 5-8 tools at most. Both perform far better than a single call with 25 tools.

Frameworks like LangGraph, AutoGen, and CrewAI formalize this; you can also build it yourself in ~100 lines of orchestration code.

Tool description discipline

Tool descriptions are prompts. Treat them with the same rigor.

Bad:

{
  "name": "send_email",
  "description": "Sends an email",
  "input": { "to": "string", "body": "string" }
}

Better:

{
  "name": "send_email",
  "description": "Send an email from the authenticated user's account. Use this for one-off transactional emails (purchase confirmations, password resets, status updates). Do NOT use for marketing emails or newsletters — those require explicit user consent and go through send_marketing_email instead. Always confirm the email content with the user before sending unless they've pre-approved the action.",
  "when_to_use": [
    "User asks to send a specific email",
    "An automated workflow needs to notify someone"
  ],
  "when_not_to_use": [
    "Sending to multiple recipients (use send_bulk_email)",
    "Marketing content (use send_marketing_email)",
    "Confidential information (use secure_send_email)"
  ],
  "input": {
    "to": "Single email address (string, validated)",
    "subject": "Subject line, plain text, max 100 chars",
    "body_markdown": "Markdown-formatted body; will be rendered to HTML",
    "category": "transactional | system | status_update"
  }
}

The verbose version costs tokens but eliminates entire classes of misuse. Worth it for any tool that takes consequential action.

Disambiguate by intent, not implementation

Don't name tools after how they work. Name them after what they accomplish.

Implementation-named (bad)Intent-named (good)
postgres_query, redis_getfind_customer, get_session_data
call_stripe_apirefund_payment
send_via_sendgridsend_email
shell_exec(don't expose this directly to agents)

The model picks tools based on description matching the task. "Find the customer with email X" matches `find_customer` cleanly; it has to do extra work to map onto `postgres_query`.

Avoid generic escape hatches

The single most dangerous pattern in agent tooling: a generic `run_code` or `execute_query` tool. The model will use it for everything because it's the most flexible — and you've just given it RCE-equivalent access to your infrastructure.

If you need flexibility, sandbox aggressively (no network, no file system, no env vars) and rate-limit. Better: don't expose generic execution to agents at all. Build the specific tools you actually need.

Rule of thumb: if a tool's blast radius is larger than the agent's mandate, it shouldn't exist in your toolset. Constrain the tools, not the agent.
LESSON 10.2

Planning, replanning, and recovery

ReAct works for 3-5 step tasks. Production agents routinely need 20+ steps to complete a task — and any step can fail. This lesson is about how to keep a long-running agent from collapsing under its own context window.

Plan-then-execute

For tasks longer than ~5 steps, separate the planning phase from the execution phase:

// Phase 1: Plan
plan = llm(planning_prompt, task, available_tools) → list of steps

// Phase 2: Execute
for step in plan:
    result = llm(execution_prompt, step, tools_for_this_step)
    if result.failed:
        plan = replan(task, completed_steps, failed_step, error)
        // restart at the right point

The planning model can be smaller/cheaper since it just produces structured output. The execution model is where you spend on quality.

A reasonable planning prompt:

You are planning how to complete this task:
{task}

Available tools:
{tool_summaries}

Produce a JSON plan as a list of steps. Each step has:
  - id (1, 2, 3...)
  - description (one sentence)
  - depends_on (array of step ids that must complete first)
  - tools (array of tool names this step might use)
  - success_criteria (how to know this step is done)

Constraints:
- 3-10 steps total. If a task seems to need more, you may be over-decomposing.
- Steps should be parallelizable where possible (depends_on=[])
- Each step should be small enough that a focused executor can finish it without further planning.

Output JSON only.

The "scratchpad" pattern

Long-running agents accumulate context — every tool call response, every reasoning step. Past ~30 steps, the relevant signal is buried in noise. The fix: maintain a structured scratchpad.

{
  "task": "Find and refund the duplicate charge for customer X",
  "completed_steps": [
    { "id": 1, "summary": "Located customer record: id=cus_abc123" },
    { "id": 2, "summary": "Found duplicate charges: ch_1, ch_2 (both $99 on 2026-03-15)" }
  ],
  "current_findings": {
    "customer_id": "cus_abc123",
    "duplicate_charges": ["ch_1", "ch_2"],
    "original_charge": "ch_1",  // earlier timestamp
    "duplicate_to_refund": "ch_2"
  },
  "next_step": { "id": 3, "description": "Issue refund for ch_2" }
}

The agent gets this scratchpad instead of (or alongside) the full conversation history. Massive token savings. Better reasoning, because the relevant state is structured.

The scratchpad updates between steps via either (a) the agent emitting a structured update or (b) a separate "summarizer" model call. Option (b) is more reliable.

Error recovery taxonomy

When a tool call fails, the agent needs to know how to react. Different errors warrant different responses:

Error typeRight response
Transient (rate limit, timeout)Retry with backoff (2-3 tries)
Invalid inputReformulate the call, don't retry verbatim
Permission deniedSurface to user; don't escalate without authorization
Resource not foundBranch the plan — was the resource the goal, or a prerequisite?
Tool returned wrong dataTry alternative tool or replan
Unknown errorHalt and surface to a human

Encode this in the error responses you give the agent. Don't just return raw error strings — return error type + actionable hint.

// Bad
return { "error": "API returned 429" }

// Good
return {
  "error_type": "transient",
  "message": "Rate limited; the API allows 60 req/min. Last reset in 23 seconds.",
  "suggested_action": "wait_and_retry",
  "wait_seconds": 23
}

Termination conditions

Long agents need multiple safeguards against runaway execution:

All four trip independently. Whichever triggers first wins.

Human checkpoints

For high-consequence actions (financial transactions, account changes, sending external communications), put humans in the loop:

Match the friction to the consequence. Confirmation for every action is annoying; review-after for refunds-over-$1000 is appropriate.

Exercise: For an agent task you've built or are designing, list every step that could fail. For each, write what the error response should tell the agent. This is more useful than another prompt tweak.
LESSON 10.3

Multi-agent systems and orchestration

The next step up from a single agent: multiple agents that hand work to each other. This unlocks specialization (each agent has its own narrow prompt and toolset) and parallelism (agents work simultaneously). It also unlocks new failure modes. This lesson is about when to reach for multi-agent and how to keep it from spiraling.

When multi-agent is the right answer

A single agent with a long prompt and many tools can do almost anything. So when do you split into multiple agents?

Good reasons:

Bad reasons:

Common multi-agent patterns

1. Pipeline (sequential). Output of agent A feeds agent B feeds agent C. Each agent is specialized. Classic for content workflows: research → draft → edit → publish.

2. Manager-worker (orchestration). A "manager" agent decomposes the task and dispatches subtasks to "worker" agents. Workers don't know about each other; they just receive task → return result. The manager assembles.

3. Parallel + aggregator. Same task fanned out to multiple agents that process in different ways; results combined. Good for "draft 5 options" or "gather opinions from different perspectives."

4. Debate. Two agents argue opposing positions; a third evaluates. Useful for high-stakes decisions where you want explicit consideration of trade-offs.

Start with pipeline. It's by far the most common and most reliable.

The communication problem

Multi-agent setups fail in characteristic ways at the agent-to-agent communication layer:

The fixes:

Orchestration frameworks: tools, not solutions

LangGraph, CrewAI, AutoGen, Anthropic Agent SDK — all give you scaffolding for multi-agent systems. They handle the loops, the state, the tool dispatching. They don't:

Pick a framework based on your stack and the agent topology you need, not based on which has the best demos. Better yet: build single-agent systems until you genuinely hit the wall, then introduce multi-agent only where it pays off.

Evaluation: the hardest part

Multi-agent systems are stochastic at every hop. An eval that tests "did the final output meet the spec" misses where the system broke when it does. Better:

When to ship vs. when to wait

Multi-agent is a rapidly-evolving area. Frameworks are immature. Best practices are forming. Costs are higher than they should be.

If you're shipping a product, lean toward single-agent + good tools. Multi-agent is the right call when:

For most production AI features in 2026, a well-designed single agent with 8-12 tools beats a 5-agent system. That ratio will shift as orchestration tools mature, but right now: simpler is better.

Exercise: Sketch a 3-agent pipeline (research → draft → edit) for a workflow you care about. Then sketch the same workflow as a single agent with all three sub-roles in the system prompt. Run both on the same 10 inputs. Compare quality, cost, latency, and debuggability. The single-agent version often wins on at least three of those.

Module 10 wrap-up

You now have the production agent toolkit: tool design that scales, planning that survives long tasks, error recovery with intent, and multi-agent orchestration with eyes open about the costs. The patterns in this module are the highest-leverage skills in current AI engineering — they're what separates "a demo that works" from "a system that runs."

The course ends here. From here, you build. The Projects section gives you three capstone projects to apply everything from Modules 1-10. Pick one, ship it, and you've turned this course into something you can put on your resume.