MODULE 05

Evaluation & Iteration

"It feels better" is not a unit. This module is about turning prompt work from gut-feel into a measurement-driven discipline. Build evals once, and every prompt change after that becomes a confident yes-or-no.

3 lessons·~120 minutes

LESSON 5.1

Building eval datasets that catch regressions

An eval dataset is a set of inputs paired with expected behavior, used to score prompt changes. Most teams either don't have one, have one that's too small, or have one that's too easy. This lesson is about building one that actually catches regressions.

What an eval is

An eval is three things:

A dataset of inputs (and, often, expected outputs or properties)
A grader that scores each output for that input
A metric that aggregates per-input scores into a single number

You run a prompt against the dataset, the grader scores each output, and you get a number. When you change the prompt, the number changes. You ship if the number went up.

Sources of eval inputs

Where do the inputs come from? In rough order of preference:

Real user inputs you've already seen. Sample from production logs. The distribution matches reality. Always do this first.
Hand-crafted edge cases. Inputs designed to probe known failure modes — empty inputs, very long inputs, multilingual inputs, adversarial inputs.
Synthetic inputs from an LLM. Useful for coverage, but they have a known weakness: they look like what an LLM would generate, which is not always what users generate. Use as a supplement, not a primary source.

Aim for a dataset that's a mix: 60–80% real samples for representativeness, 20–40% hand-crafted edge cases for coverage of the long tail.

Size matters, but less than you think

You don't need 10,000 examples. You need enough to get statistically meaningful signal on changes that matter. For most production tasks, 50–200 well-chosen examples is sufficient to start. As your system matures, you grow the set — especially with the inputs that exposed real regressions.

Critically: every time you find a production failure, that input goes into the eval set with its expected behavior. Your eval grows by accreting hard cases. After a year, your eval is a museum of every way your system has ever broken.

What the "expected output" should look like

For some tasks, "expected output" is exact: this question has this answer. For most tasks, it's looser: "the response should mention X, should not mention Y, and should be in this format."

You don't always need a single ground-truth answer. Often what you want is a set of properties the output must have:

{
  "input": "I was charged twice for May.",
  "must_classify_as": "billing",
  "must_extract_fields": ["amount", "month"],
  "must_not_contain": ["technical support", "feature request"],
  "tone": "empathetic"
}

The grader checks each property and reports per-property pass/fail. This is much more flexible than exact-match grading and reflects what you actually care about.

Grader types

Three flavors of grader, in increasing complexity:

1. Exact match / regex. Cheapest, fastest, most reliable when applicable. Use it for classification, extraction, structured output where the answer is unambiguous.

2. Programmatic property checks. Custom code that asserts properties of the output (length, format validity, presence of required substrings, parseability). Good for structured tasks where you can articulate the rules.

3. LLM-as-judge. An LLM scores the output. Use when the right answer is fuzzy ("is this response empathetic?", "is this summary accurate?"). Powerful and dangerous — see Lesson 5.2.

Most production eval suites use a mix: regex graders for the cheap-to-check properties, programmatic graders for structural correctness, and LLM-as-judge for the qualitative dimensions.

The "split your eval" discipline

Just like ML training, you want a held-out set you don't iterate against. Otherwise you'll overfit your prompts to your eval.

Dev set: ~70% of your data. You iterate against this — every prompt change is scored on dev.
Test set: ~30% of your data. You only run this when you think you have a winner. If dev and test scores diverge significantly, you've overfit.

Diff-based eval scoring

The most useful number isn't the absolute score; it's the diff. When you change a prompt, you care about: which examples improved? Which got worse? Which are now newly broken?

      Before    After    Δ
Pass:   142      147     +5
Fail:    58       53     -5

Newly passing: 8
Newly failing: 3   ← INVESTIGATE THESE

Even if the aggregate score went up, the three newly-failing examples might be in your most important segment. The diff is the unit of analysis, not the overall percentage.

Heuristic: No prompt change ships without an eval delta. If you can't quantify the change, you don't know if it's an improvement.

LESSON 5.2

LLM-as-judge: when it works, when it lies

LLM-as-judge is the practice of using an LLM to score outputs from another LLM (or the same one). It's how you scale evaluation past what humans can grade. It's also how you build a system that lies to you about its own quality if you're not careful.

The basic pattern

You are evaluating an AI assistant's response.

Question: {question}
Response: {response}

Score the response from 1 to 5 on the following criteria:
- Accuracy: are the facts correct?
- Helpfulness: does it answer the question?
- Tone: is it appropriate for a professional context?

Respond as JSON: {"accuracy": N, "helpfulness": N, "tone": N, "reasoning": "..."}

Run this judge over your eval set, aggregate the scores, get a number. Repeat per prompt change.

Known failure modes of LLM judges

LLM judges are systematically biased in characteristic ways. Knowing the biases is the difference between using them well and being lied to by your own metrics.

Position bias. When asked to compare two responses (A vs. B), the judge prefers whichever one came first by a measurable margin. Mitigation: always run pairs in both orders and average, or use absolute scoring per response.

Length bias. Longer responses score higher, even when they're verbose without being more correct. Mitigation: include length as an explicit dimension and penalize verbosity.

Self-preference. An LLM judge tends to prefer outputs from the same model family. If you're evaluating GPT outputs with a GPT judge, expect inflated scores. Mitigation: use a different model family as judge, or use multiple judges and look for agreement.

Sycophancy. Judges are trained to be agreeable. If your prompt to the judge implies what you want ("evaluate whether the response is helpful"), the judge skews toward "helpful." Mitigation: write judge prompts that present the criteria neutrally, with examples of high and low scores.

Surface-feature reliance. Judges grade things that look good — bullet points, headings, confident tone — even when content is wrong. Mitigation: include explicit criteria for substance over style, and pair LLM judges with programmatic graders that check facts.

Designing a reliable judge prompt

Some practices that improve judge reliability:

Use a rubric, not vibes. Define each score level with concrete examples. "5 = answers the question completely with verifiable facts; 3 = partial answer with some unverifiable claims; 1 = wrong or off-topic."
Force reasoning before scoring. Have the judge explain its assessment first, then produce the score. Like CoT, the explanation conditions the score and reduces snap judgments.
Score one dimension at a time. A single multi-dimension judge call is cheaper but less reliable. If the dimensions matter independently, score them in separate calls.
Include the gold standard when you have one. "Compare the response to this reference answer" is more reliable than "judge this response on its own."

Calibrate the judge

Before trusting a judge at scale, validate it on a sample where humans have already scored. The judge isn't ready until its scores correlate strongly with human scores on the same examples. If they don't, fix the rubric, not the data.

# Calibration: do judges agree with humans?
human_scores = load_human_scored_subset(n=50)
judge_scores = [judge(item) for item in human_scores]
correlation = spearman(human_scores, judge_scores)

# Aim for > 0.7 correlation before trusting at scale

Hybrid grading

The strongest eval suites combine:

Programmatic graders for things that are objectively checkable
LLM judges for things that aren't, with strong rubrics and validation
Periodic human review on a small sample, to keep both honest

The human review is the anchor. If your LLM judge starts diverging from human judgment, your metrics are lying and you need to recalibrate before you ship anything based on them.

Trust, but verify. An LLM judge is a useful instrument, not a source of truth. Treat its scores like sensor readings — informative when calibrated, misleading when not.

LESSON 5.3

Failure-mode analysis and structured iteration

Once you have an eval, the question becomes: how do you iterate on prompts efficiently? The undisciplined answer is "tweak and re-run." The disciplined answer is to study failures, hypothesize causes, and target changes.

The iteration loop

The loop you want:

Run the prompt against the eval. Note the failures.
Categorize failures by type. Don't fix individual failures; fix categories.
Pick the largest category. Hypothesize a prompt change that would fix it.
Make the change. Re-run the eval.
Check: did the target category improve? Did anything else regress?
Keep the change if net-positive. Otherwise revert and try a different hypothesis.

This is slower than "tweak and pray," and it gets you to a better prompt much faster.

Categorizing failures

A useful taxonomy of common LLM failure modes:

Category	What it looks like	Likely fix
Format drift	Output structure varies, breaks parsers	Tighter format spec, examples, output validation
Hallucination	Confident invented facts	Ground in retrieved context, require citations, "I don't know" instruction
Instruction lapse	Ignores a constraint that's in the prompt	Move constraint to system prompt; restate near task; add example
Over-refusal	Refuses benign requests	Specify what's in scope; add examples of compliant responses
Verbose / off-topic	Answers correctly but with fluff	Hard length cap; "respond in N sentences"; format constraint
Inconsistent tone	Style varies across outputs	Persona tightening; tone examples; explicit tone rules
Edge-case error	Specific input class fails	Add examples of that class; edge-case branching in prompt

When you scan a batch of failures, classify each into one of these (extend the taxonomy as needed). The category points directly at the kind of fix.

One change at a time

The single most important rule: change one thing per iteration. If you change three things at once and the eval improves, you don't know which change helped. If the eval regresses, you don't know which change hurt. Discipline yourself to atomic changes.

Don't fight the model

If a prompt change isn't fixing a failure category after 2–3 attempts, the prompt isn't the problem. Possibilities:

The task genuinely exceeds the model's capability — try a stronger model
The task isn't decomposable in a single prompt — chain it (Module 3)
The model needs grounding it doesn't have — add retrieval (Module 4)
The eval is wrong — the "failure" is actually correct behavior with a bad expected answer

Knowing when to stop iterating on a prompt and reach for a different lever is itself a skill.

The "rewrite from scratch" reset

Prompts that have been iterated on many times accumulate cruft — instructions that addressed bugs that no longer exist, examples that fight each other, redundant constraints. Periodically, rewrite the prompt from scratch using only what you've learned about the task. Eval both. Often the cleaner rewrite scores better and is half the length.

Eval-driven development as culture

If you take one habit from this module: every prompt change goes through eval. No exceptions. Not "trust me, this is better." Not "we'll add it to the eval later." The eval is the spec; the prompt is the implementation; the diff is the proof.

Teams that internalize this ship faster, regress less, and onboard new engineers more easily — because the prompt's quality is visible and measurable, not folklore.

Exercise: Take your last 20 production failures (or a synthetic substitute). Categorize them using the taxonomy above. The largest category is your first target. Pick one change, make it, re-eval. Did the target category drop? What regressed?

Module 5 wrap-up

You now have the meta-skill that ties everything together: how to know if a prompt is working and how to make it better systematically. Module 6 — the final module — covers the security and ethics dimensions. The patterns you've built work; this module is how to make sure they don't get exploited.