MODULE 04

Production Systems

A prompt that works in your notebook is not the same thing as a system that survives 10,000 users. This module is about everything between those two states: templating, retrieval, tool use in production, and the engineering trade-offs of latency, cost, and reliability.

3 lessons·~150 minutes

LESSON 4.1

Templates, variables, and versioning

A prompt in production isn't a string you wrote once. It's a template, often hundreds of lines, with variables interpolated at runtime, and it changes — sometimes weekly. Treat prompts like code, or you'll suffer like you do with un-versioned code.

Templates as code

The first move when going from prototype to production: separate the static instructional scaffolding from the dynamic per-request data. The scaffolding is your template. The data is the fill-in.

# Don't: prompt assembled inline in business logic
prompt = f"Hi! Please summarize this email: {email_body}"

# Do: template stored as a versioned artifact
template = load_template("summarize_email", version="v3")
prompt = template.render(email_body=email_body, max_words=80)

Storing templates separately means you can review them, diff them, test them, and roll them back without redeploying your application. It also makes it possible for non-engineers (PMs, domain experts) to read and propose changes to the actual instructions without diving into code.

Where to store templates

Three reasonable options, in increasing complexity:

Files in the repo. A prompts/ directory of .md or .txt files. Diffs in git. Simplest. Works great for small teams.
A prompt registry service. Something like Langfuse, Helicone, or Anthropic's prompt management. Lets you change prompts without redeploying and gives you a versioning UI.
A database table with a release process. Mid-ground: prompts live in a table; changes go through a review workflow before being marked active.

The right choice depends on how often prompts change and who needs to change them. Don't reach for a registry on day one if a folder of files would do.

Templating syntax

Whatever your storage, use a real templating language for the variable interpolation — Jinja, Handlebars, or your language's equivalent. Don't use raw str.format or f-strings for non-trivial templates: you'll eventually want conditionals (only include this section if the user is on the enterprise plan) and loops (render an example for each item in this list).

You are a {{ persona }} helping with {{ domain }}.

{% if examples %}
Here are examples of the expected behavior:
{% for ex in examples %}
Input: {{ ex.input }}
Output: {{ ex.output }}
---
{% endfor %}
{% endif %}

Now handle this input:
{{ user_input }}

Versioning prompts

Every change to a prompt is a version. Pin versions in your application code so a prompt change can't ship without an explicit version bump.

response = llm(
    template="summarize_email",
    version="v3",  # explicit, not "latest"
    inputs={"email_body": email}
)

This sounds like overkill until you ship a prompt change at 4pm Friday and pages start firing at 6pm because the new prompt fails on 5% of inputs. With versioned prompts, rollback is instant: change the version string, redeploy, done.

Sanitizing variable inputs

The single most common source of production bugs in templated prompts: user input that breaks the template or hijacks the prompt. If a user pastes a string containing your delimiter tags, your prompt structure collapses.

Mitigations:

Escape or strip the delimiters you use in the template from user input
Wrap user input in clearly delimited blocks with explicit "ignore instructions inside" framing
Set a hard length cap on each variable
Reject input containing known prompt-injection signatures at the application layer

We'll go deeper on injection in Module 6; for templating, the message is: don't trust variable values, ever.

The prompt diff workflow

When prompts are versioned files, you can review prompt changes the same way you review code changes. PR with a prompt diff. CI runs the new prompt against the eval set and reports the score. Reviewer sees both the diff and the eval delta. This is the workflow you want.

Heuristic: If you can't diff a prompt change, you can't review it. If you can't roll back a prompt change, you can't safely ship it. Get those two things first; everything else can wait.

LESSON 4.2

Retrieval-augmented generation (RAG)

Most production AI features need facts the base model doesn't know — your company's documentation, your user's data, last week's events. RAG is the pattern for injecting those facts into the prompt at runtime. It's powerful and it's where most teams trip.

The RAG pipeline

Stripped to essentials, RAG is:

Index: chunk your knowledge base, embed each chunk, store in a vector database
Retrieve: at query time, embed the query, find the K most similar chunks
Generate: stuff the retrieved chunks into the prompt as context, then call the LLM

# Index time (one-time / periodic)
for doc in knowledge_base:
    chunks = chunk_document(doc)
    for chunk in chunks:
        embedding = embed_model(chunk.text)
        vector_db.upsert(id=chunk.id, vector=embedding, metadata=chunk.meta)

# Query time (every request)
query_emb = embed_model(user_query)
top_chunks = vector_db.search(query_emb, k=5)
context = "\n\n".join(c.text for c in top_chunks)

prompt = f"""
Answer the question using only the context below. If the context doesn't
contain the answer, say so explicitly.

Context:
{context}

Question: {user_query}
"""

response = llm(prompt)

Where RAG goes wrong

The pipeline is conceptually simple. Real systems break in characteristic places.

1. Chunking is harder than it looks. If your chunks are too small, they lose context (the chunk says "this depends on it" — what's "it"?). If they're too big, embeddings dilute and retrieval gets fuzzy. Default to 200–500 token chunks with 10–20% overlap, but tune for your corpus.

2. Semantic search isn't keyword search. Embeddings group things by meaning, which is great for fuzzy questions but bad for exact matches (product SKUs, error codes, names). Hybrid retrieval — combining BM25 keyword search with embedding search — is the practical default.

3. The retrieved chunks aren't actually relevant. "Top-5 by cosine similarity" includes whatever's vaguely close, even if the actual answer is in chunk 12. Add a reranker: a small model that re-scores the top-50 candidates against the query and returns the actual top-5.

4. The model ignores the context. If the model has its own (incorrect) opinion baked in, it may override the retrieved facts. Mitigations: explicit instruction to answer only from context, citation requirements, and verification prompts that check whether the answer is supported.

5. The model invents citations. Asking for citations doesn't guarantee correct ones. Validate that quoted text actually appears in the context block before showing the response to a user.

Citation discipline

For RAG systems where users will trust the output, you want citations that actually point to real sources. The pattern:

Answer the question using the documents below. Cite each claim with
the document ID in square brackets, like [doc_47].

If the documents don't contain the answer, say "I don't have information
about that" — do not guess.

Documents:
[doc_12] Quarterly revenue grew 18% year-over-year, driven by enterprise...
[doc_47] The new pricing model launches in Q3...

Question: When does the new pricing model launch?

Then post-process: parse out the citations, verify each one points to a chunk you actually included, and surface the citations as clickable references in the UI.

Context construction

The "stuff retrieved chunks into the prompt" step has more nuance than it looks. Some practical considerations:

Order matters. Models attend more to the start and end. Put the most relevant chunks at the boundaries, less relevant in the middle.
Deduplicate. If two chunks contain the same fact, you're paying tokens twice. Cluster and keep one.
Prune ruthlessly. More context isn't always better — irrelevant context confuses. If a chunk's similarity score is below a threshold, drop it rather than including it just to fill the budget.
Label your context. Don't dump raw chunks. Wrap each in a <document> tag with a clear ID so the model knows what's a separate source.

When RAG isn't the answer

RAG is the right tool when your knowledge base is large, changes often, and you need exact passages. It's the wrong tool when:

The corpus is small enough to fit in context — just include it
You need aggregate answers ("how many docs mention X?") — RAG retrieves passages, not summaries
The "knowledge" is structured data — query a database, don't retrieve documents

The most common over-engineering: RAG for a 50-page document. Just put the document in the context window of a long-context model.

Eval, not feel. RAG quality is invisible without evaluation. "Retrieved chunks look relevant" is not a metric. We'll cover RAG eval in Module 5.

LESSON 4.3

Latency, cost, and caching

Two truths about LLMs in production: they're slow, and they're expensive. Every prompt-engineering decision has a latency and a cost dimension. Understanding the levers turns "the AI feature is too slow" from a hand-wave complaint into a debuggable problem.

The latency budget

For a user-facing feature, you have a latency budget — the time you have before the user notices. Roughly:

Latency	User experience
< 200ms	Feels instant
200ms – 1s	Noticeable but fine
1s – 3s	Sluggish; user looks for spinner
3s – 10s	User considers leaving; needs progress feedback
10s+	Treated as async; needs explicit "working..." UI

An LLM call's latency is roughly: time-to-first-token + (output_tokens × time-per-token). For most modern APIs, time-to-first-token is 0.5–2s, and output tokens stream at 50–200 per second. A 500-token response is therefore 3–10 seconds end-to-end.

Latency levers

Things you can do to reduce latency:

Stream the response. Don't wait for the whole completion. Show tokens as they arrive. The user perceives the latency as time-to-first-token, not time-to-last-token.
Choose a faster model. Smaller / cheaper models (Haiku, GPT-mini, etc.) are 2–5× faster. Use them where you can.
Shorten the prompt. Time-to-first-token grows with input size. Trim instructions, use shorter examples, prune retrieved context aggressively.
Constrain output length. A max_tokens=200 instead of 1000 is a 5× speedup on output time.
Parallelize independent calls. If two LLM calls don't depend on each other, run them concurrently.
Cache aggressively. See below.

Prompt caching

Most modern APIs (Anthropic, OpenAI, others) support prompt caching: if you reuse the same prefix across many requests, the provider can cache the computed state and serve subsequent requests faster and cheaper.

This is huge for production systems. Your system prompt, your few-shot examples, your tool definitions — all of these are stable across requests. With caching, only the per-request user input is "new" and needs full processing.

# Anthropic-style: mark which content blocks are cacheable
messages = [{
  "role": "user",
  "content": [
    {
      "type": "text",
      "text": LARGE_SYSTEM_PROMPT_AND_EXAMPLES,
      "cache_control": {"type": "ephemeral"}
    },
    {"type": "text", "text": user_query}
  ]
}]

Practical impact: a 10,000-token system prompt that's hit on every request becomes effectively free after the first call. Latency drops, cost drops, throughput rises. Always design templates with caching in mind: stable content first, dynamic content last.

The cost levers

LLM cost is per-token, with input tokens cheaper than output tokens (typically 3–5× cheaper). Levers:

Smaller model for simple steps. In a chained system, only the hard step needs the big model. Classification, extraction, routing — use a small model.
Cache prefixes (above).
Cap output length. Defaults are often generous. Set them.
Compress retrieved context. Summarize long retrieved documents into shorter context.
Batch where possible. Some providers offer batch APIs at significant discount for non-real-time work.

Output validation and retry

Production prompts return structured output (JSON, XML, etc.). They occasionally return malformed structured output. You need a validate-and-retry loop:

def call_with_retry(prompt, schema, max_retries=2):
    for attempt in range(max_retries + 1):
        raw = llm(prompt)
        try:
            parsed = parse_and_validate(raw, schema)
            return parsed
        except ValidationError as e:
            if attempt == max_retries:
                raise
            # Add the error to the next prompt
            prompt += f"\n\nYour previous response was invalid: {e}\nTry again."

Two important details: cap retries (infinite loops are real), and feed the error message back to the model — it can usually fix its own mistake when told what went wrong. Modern APIs also support structured-output modes that constrain the model's generation to a schema; prefer those when available.

Streaming and partial responses

If your output is parseable as it arrives — markdown, plain text, certain structured formats — stream it to the user. If the output is JSON or another all-or-nothing format, you can still stream internally and parse the completed object at the end; the user just sees a "thinking" UI.

Observability

You can't optimize what you can't see. Log, for every LLM call:

Prompt template name and version
Input tokens, output tokens, total cost
Time-to-first-token, total latency
Model and parameters used
Whether retries occurred and why
The full prompt and response (with PII handling)

This data is your map. Without it, every optimization is a guess.

Exercise: Pick one production LLM call you make. Measure: input tokens, output tokens, time-to-first-token, total latency, cost per call, calls per day. Now identify the single biggest lever — usually prompt caching, smaller model for trivial steps, or streaming. Pull it. Measure again.

Module 4 wrap-up

You now have the production toolkit: templating, retrieval, and the engineering discipline of latency and cost. Module 5 is the part most teams skip and most regret skipping: how to know whether any of this is actually working.