MODULE 04

Production Systems

A prompt that works in your notebook is not the same thing as a system that survives 10,000 users. This module is about everything between those two states: templating, retrieval, tool use in production, and the engineering trade-offs of latency, cost, and reliability.

3 lessons·~150 minutes
LESSON 4.1

Templates, variables, and versioning

A prompt in production isn't a string you wrote once. It's a template, often hundreds of lines, with variables interpolated at runtime, and it changes — sometimes weekly. Treat prompts like code, or you'll suffer like you do with un-versioned code.

Templates as code

The first move when going from prototype to production: separate the static instructional scaffolding from the dynamic per-request data. The scaffolding is your template. The data is the fill-in.

# Don't: prompt assembled inline in business logic
prompt = f"Hi! Please summarize this email: {email_body}"

# Do: template stored as a versioned artifact
template = load_template("summarize_email", version="v3")
prompt = template.render(email_body=email_body, max_words=80)

Storing templates separately means you can review them, diff them, test them, and roll them back without redeploying your application. It also makes it possible for non-engineers (PMs, domain experts) to read and propose changes to the actual instructions without diving into code.

Where to store templates

Three reasonable options, in increasing complexity:

  1. Files in the repo. A prompts/ directory of .md or .txt files. Diffs in git. Simplest. Works great for small teams.
  2. A prompt registry service. Something like Langfuse, Helicone, or Anthropic's prompt management. Lets you change prompts without redeploying and gives you a versioning UI.
  3. A database table with a release process. Mid-ground: prompts live in a table; changes go through a review workflow before being marked active.

The right choice depends on how often prompts change and who needs to change them. Don't reach for a registry on day one if a folder of files would do.

Templating syntax

Whatever your storage, use a real templating language for the variable interpolation — Jinja, Handlebars, or your language's equivalent. Don't use raw str.format or f-strings for non-trivial templates: you'll eventually want conditionals (only include this section if the user is on the enterprise plan) and loops (render an example for each item in this list).

You are a {{ persona }} helping with {{ domain }}.

{% if examples %}
Here are examples of the expected behavior:
{% for ex in examples %}
Input: {{ ex.input }}
Output: {{ ex.output }}
---
{% endfor %}
{% endif %}

Now handle this input:
{{ user_input }}

Versioning prompts

Every change to a prompt is a version. Pin versions in your application code so a prompt change can't ship without an explicit version bump.

response = llm(
    template="summarize_email",
    version="v3",  # explicit, not "latest"
    inputs={"email_body": email}
)

This sounds like overkill until you ship a prompt change at 4pm Friday and pages start firing at 6pm because the new prompt fails on 5% of inputs. With versioned prompts, rollback is instant: change the version string, redeploy, done.

Sanitizing variable inputs

The single most common source of production bugs in templated prompts: user input that breaks the template or hijacks the prompt. If a user pastes a string containing your delimiter tags, your prompt structure collapses.

Mitigations:

We'll go deeper on injection in Module 6; for templating, the message is: don't trust variable values, ever.

The prompt diff workflow

When prompts are versioned files, you can review prompt changes the same way you review code changes. PR with a prompt diff. CI runs the new prompt against the eval set and reports the score. Reviewer sees both the diff and the eval delta. This is the workflow you want.

Heuristic: If you can't diff a prompt change, you can't review it. If you can't roll back a prompt change, you can't safely ship it. Get those two things first; everything else can wait.
LESSON 4.2

Retrieval-augmented generation (RAG)

Most production AI features need facts the base model doesn't know — your company's documentation, your user's data, last week's events. RAG is the pattern for injecting those facts into the prompt at runtime. It's powerful and it's where most teams trip.

The RAG pipeline

Stripped to essentials, RAG is:

  1. Index: chunk your knowledge base, embed each chunk, store in a vector database
  2. Retrieve: at query time, embed the query, find the K most similar chunks
  3. Generate: stuff the retrieved chunks into the prompt as context, then call the LLM
# Index time (one-time / periodic)
for doc in knowledge_base:
    chunks = chunk_document(doc)
    for chunk in chunks:
        embedding = embed_model(chunk.text)
        vector_db.upsert(id=chunk.id, vector=embedding, metadata=chunk.meta)

# Query time (every request)
query_emb = embed_model(user_query)
top_chunks = vector_db.search(query_emb, k=5)
context = "\n\n".join(c.text for c in top_chunks)

prompt = f"""
Answer the question using only the context below. If the context doesn't
contain the answer, say so explicitly.

Context:
{context}

Question: {user_query}
"""

response = llm(prompt)

Where RAG goes wrong

The pipeline is conceptually simple. Real systems break in characteristic places.

1. Chunking is harder than it looks. If your chunks are too small, they lose context (the chunk says "this depends on it" — what's "it"?). If they're too big, embeddings dilute and retrieval gets fuzzy. Default to 200–500 token chunks with 10–20% overlap, but tune for your corpus.

2. Semantic search isn't keyword search. Embeddings group things by meaning, which is great for fuzzy questions but bad for exact matches (product SKUs, error codes, names). Hybrid retrieval — combining BM25 keyword search with embedding search — is the practical default.

3. The retrieved chunks aren't actually relevant. "Top-5 by cosine similarity" includes whatever's vaguely close, even if the actual answer is in chunk 12. Add a reranker: a small model that re-scores the top-50 candidates against the query and returns the actual top-5.

4. The model ignores the context. If the model has its own (incorrect) opinion baked in, it may override the retrieved facts. Mitigations: explicit instruction to answer only from context, citation requirements, and verification prompts that check whether the answer is supported.

5. The model invents citations. Asking for citations doesn't guarantee correct ones. Validate that quoted text actually appears in the context block before showing the response to a user.

Citation discipline

For RAG systems where users will trust the output, you want citations that actually point to real sources. The pattern:

Answer the question using the documents below. Cite each claim with
the document ID in square brackets, like [doc_47].

If the documents don't contain the answer, say "I don't have information
about that" — do not guess.

Documents:
[doc_12] Quarterly revenue grew 18% year-over-year, driven by enterprise...
[doc_47] The new pricing model launches in Q3...

Question: When does the new pricing model launch?

Then post-process: parse out the citations, verify each one points to a chunk you actually included, and surface the citations as clickable references in the UI.

Context construction

The "stuff retrieved chunks into the prompt" step has more nuance than it looks. Some practical considerations:

When RAG isn't the answer

RAG is the right tool when your knowledge base is large, changes often, and you need exact passages. It's the wrong tool when:

The most common over-engineering: RAG for a 50-page document. Just put the document in the context window of a long-context model.

Eval, not feel. RAG quality is invisible without evaluation. "Retrieved chunks look relevant" is not a metric. We'll cover RAG eval in Module 5.
LESSON 4.3

Latency, cost, and caching

Two truths about LLMs in production: they're slow, and they're expensive. Every prompt-engineering decision has a latency and a cost dimension. Understanding the levers turns "the AI feature is too slow" from a hand-wave complaint into a debuggable problem.

The latency budget

For a user-facing feature, you have a latency budget — the time you have before the user notices. Roughly:

LatencyUser experience
< 200msFeels instant
200ms – 1sNoticeable but fine
1s – 3sSluggish; user looks for spinner
3s – 10sUser considers leaving; needs progress feedback
10s+Treated as async; needs explicit "working..." UI

An LLM call's latency is roughly: time-to-first-token + (output_tokens × time-per-token). For most modern APIs, time-to-first-token is 0.5–2s, and output tokens stream at 50–200 per second. A 500-token response is therefore 3–10 seconds end-to-end.

Latency levers

Things you can do to reduce latency:

Prompt caching

Most modern APIs (Anthropic, OpenAI, others) support prompt caching: if you reuse the same prefix across many requests, the provider can cache the computed state and serve subsequent requests faster and cheaper.

This is huge for production systems. Your system prompt, your few-shot examples, your tool definitions — all of these are stable across requests. With caching, only the per-request user input is "new" and needs full processing.

# Anthropic-style: mark which content blocks are cacheable
messages = [{
  "role": "user",
  "content": [
    {
      "type": "text",
      "text": LARGE_SYSTEM_PROMPT_AND_EXAMPLES,
      "cache_control": {"type": "ephemeral"}
    },
    {"type": "text", "text": user_query}
  ]
}]

Practical impact: a 10,000-token system prompt that's hit on every request becomes effectively free after the first call. Latency drops, cost drops, throughput rises. Always design templates with caching in mind: stable content first, dynamic content last.

The cost levers

LLM cost is per-token, with input tokens cheaper than output tokens (typically 3–5× cheaper). Levers:

Output validation and retry

Production prompts return structured output (JSON, XML, etc.). They occasionally return malformed structured output. You need a validate-and-retry loop:

def call_with_retry(prompt, schema, max_retries=2):
    for attempt in range(max_retries + 1):
        raw = llm(prompt)
        try:
            parsed = parse_and_validate(raw, schema)
            return parsed
        except ValidationError as e:
            if attempt == max_retries:
                raise
            # Add the error to the next prompt
            prompt += f"\n\nYour previous response was invalid: {e}\nTry again."

Two important details: cap retries (infinite loops are real), and feed the error message back to the model — it can usually fix its own mistake when told what went wrong. Modern APIs also support structured-output modes that constrain the model's generation to a schema; prefer those when available.

Streaming and partial responses

If your output is parseable as it arrives — markdown, plain text, certain structured formats — stream it to the user. If the output is JSON or another all-or-nothing format, you can still stream internally and parse the completed object at the end; the user just sees a "thinking" UI.

Observability

You can't optimize what you can't see. Log, for every LLM call:

This data is your map. Without it, every optimization is a guess.

Exercise: Pick one production LLM call you make. Measure: input tokens, output tokens, time-to-first-token, total latency, cost per call, calls per day. Now identify the single biggest lever — usually prompt caching, smaller model for trivial steps, or streaming. Pull it. Measure again.

Module 4 wrap-up

You now have the production toolkit: templating, retrieval, and the engineering discipline of latency and cost. Module 5 is the part most teams skip and most regret skipping: how to know whether any of this is actually working.