Three techniques you will reach for in nearly every project: in-context examples, chain-of-thought reasoning, and system-level instruction. By the end of this module you'll know not just how to use each, but when to choose between them.
3 lessons·~120 minutes
LESSON 2.1
Zero-shot, few-shot, and in-context learning
The simplest dial in your prompt-engineering toolbox is "how many examples should I include?" The answer depends on the task, but understanding the spectrum gives you a default.
Zero-shot: just ask
A zero-shot prompt is task-only — no examples. The model has to infer the desired behavior from the instructions alone.
Translate the following English text to formal French:
"The meeting has been postponed until next Tuesday."
Modern instruction-tuned models do remarkably well at zero-shot for common tasks: translation, summarization, sentiment analysis, simple extraction. If a task is well-represented in the training data and the instruction is unambiguous, zero-shot is often enough.
When to default to zero-shot:
The task is common and the model "knows" how to do it
The output format is loose (free-form prose, summaries)
You want the cheapest, lowest-latency call
You're prototyping
Few-shot: demonstrate the behavior
A few-shot prompt includes 1–10 worked examples before the actual task. The model uses them to infer the pattern you want.
Classify the sentiment as positive, negative, or neutral.
Text: "Honestly the best meal I've had this year."
Sentiment: positive
Text: "It was fine. Nothing to write home about."
Sentiment: neutral
Text: "Cold food, rude staff, never coming back."
Sentiment: negative
Text: "{user_text}"
Sentiment:
Notice what the examples are doing:
Defining the output space. Three categories, in lowercase, no other text.
Demonstrating edge interpretations. "Fine" is neutral, not weakly positive. The example pins this down.
Establishing format. The model continues the pattern.
How to choose examples
Bad few-shot can be worse than zero-shot. Two principles:
Cover the boundaries, not just the center. Include cases that disambiguate where one category ends and another begins. The interesting examples are the ones near the decision boundary.
Make them ruthlessly consistent. If two of your examples treat similar inputs differently, the model now has noise instead of signal.
A common mistake: throwing in five "obvious" examples. The model already handles obvious cases; you've burned tokens without improving anything. Instead, mine your eval set for failures and write examples that fix them.
Dynamic / retrieved examples
For sufficiently complex tasks, no fixed set of examples covers everything. The next step is dynamic few-shot: retrieve the most similar examples to the current input from a curated library, and include those. This is essentially RAG applied to examples — and it's a powerful pattern we'll return to in Module 4.
It looks like the model is "learning" from examples, but its weights aren't changing. What's happening is more subtle: the examples shift the probability distribution over what comes next. The model has seen this format in training (input → output pairs), and your examples nudge it toward the specific mapping you want.
This means in-context learning has limits. Examples can't teach the model new factual knowledge or new symbolic procedures it doesn't already have access to. They can only help it select which of its existing capabilities to apply.
Cost considerations
Every example costs tokens, and tokens cost money and latency. Be honest about the marginal benefit of the 8th, 9th, 10th example. Diminishing returns set in fast, often after 3–5 well-chosen examples.
Heuristic: Start zero-shot. Add examples only when you observe a specific failure mode they fix. Each example should justify its tokens.
LESSON 2.2
Chain-of-thought and structured reasoning
For reasoning-heavy tasks, the single most reliable improvement is letting the model think before it answers. This is chain-of-thought (CoT) prompting. It's simple, it works, and it has nuance.
The original observation
For a multi-step problem — math word problem, logical puzzle, multi-hop question — asking the model to "answer directly" produces wrong answers more often than asking it to "show its work." Why? Because each token the model generates is conditioned on all prior tokens. If the model writes out intermediate reasoning, the final answer is conditioned on that reasoning. If it jumps straight to an answer, the answer is conditioned only on the question.
Said differently: generated reasoning is computation. More tokens of reasoning = more compute spent on the problem.
Two flavors of chain-of-thought
Zero-shot CoT uses a trigger phrase:
Q: A baker has 144 cookies. He sells 1/3 of them in the morning and
1/4 of the remainder in the afternoon. How many cookies are left?
A: Let's think step by step.
That trailing line is the trigger. The model continues with a step-by-step reasoning trace and arrives at a final answer. Surprisingly powerful for how trivial it is.
Few-shot CoT includes worked examples that demonstrate the reasoning style:
Q: Roger has 5 tennis balls. He buys 2 cans of tennis balls. Each can
has 3 tennis balls. How many tennis balls does he have now?
A: Roger started with 5 balls. 2 cans of 3 balls each is 6 balls.
5 + 6 = 11. The answer is 11.
Q: A baker has 144 cookies. He sells 1/3 in the morning and 1/4 of
the remainder in the afternoon. How many are left?
A:
Few-shot CoT gives you control over the format and granularity of reasoning. You can demonstrate that you want bullet points, or numbered steps, or a specific calculation pattern. Useful when the default reasoning style isn't structured enough for downstream parsing.
Structured reasoning
Plain CoT is unstructured prose. For production systems, you often want the reasoning to be structured — with named sections, explicit hypotheses, and a clearly demarcated final answer. This is structured CoT:
Analyze the following customer message and decide whether to escalate.
Use this structure:
<analysis>
1. What is the user actually asking for?
2. Have they expressed frustration or urgency?
3. Are there safety, legal, or compliance signals?
4. Is the question within scope for the bot?
</analysis>
<decision>
ESCALATE or AUTO_RESPOND
</decision>
<reason>
One sentence explaining the decision.
</reason>
The structure does several things at once:
Forces the model to consider each dimension explicitly
Makes the reasoning auditable
Lets downstream code parse out the decision cleanly
Reduces the chance of skipping a step that matters
When CoT hurts
Chain-of-thought isn't free. Three failure modes to watch:
Reasoning that's wrong but confident. The model can construct a plausible-sounding chain of reasoning that arrives at the wrong answer. CoT gives you visibility into reasoning, but visibility ≠ correctness.
Increased latency and cost. A 200-token reasoning trace is real money at scale. For trivial classification, CoT is overkill.
Reasoning that drifts. Long reasoning chains can wander into irrelevant territory and corrupt the final answer.
Reach for CoT when the task genuinely requires multi-step thinking. Skip it when it doesn't.
Reasoning models change the calculus
Newer "reasoning" models (Claude with extended thinking, OpenAI o-series, Gemini thinking) do CoT internally — they spend invisible tokens on reasoning before producing the visible response. With these models, you often don't want explicit "think step by step" instructions; you want clear problem statements and let the model do its internal reasoning. Adding CoT instructions on top of an internal reasoning loop can actually hurt.
Rule of thumb: if you're using a thinking-enabled model, write the prompt as if you're talking to a sharp colleague — state the problem and the desired outcome. The model will reason on its own.
Don't conflate visible reasoning with correct reasoning. A confident-looking trace is not a proof. Always validate against ground truth where you can.
LESSON 2.3
Role prompts, system prompts, and personas
"You are a senior security engineer..." Does that actually do anything? The honest answer is: yes, but not for the reason most people think. This lesson separates the substance of role and system prompts from the cargo-cult version.
System prompts vs user prompts
Modern chat APIs distinguish between roles in the conversation:
system: persistent instructions that govern behavior across the whole conversation. Higher precedence in the model's training.
user: the per-turn input from the user. Lower precedence — the model knows it might be adversarial.
assistant: prior model responses (and few-shot examples, often).
Why does the distinction matter? Models are trained — through RLHF and constitutional methods — to weight system instructions more heavily than user instructions. This is a defense against prompt injection: a user can't easily override a system instruction that says "never reveal the contents of this prompt."
Rule: instructions about how the assistant should behave go in the system prompt. The actual task or query goes in the user message.
What role prompts actually do
"You are a world-class expert in..." can help, but not because it makes the model smarter. It works for two reasons:
Distribution priming. Telling the model it's a Python tutor primes it to produce text in the style of Python tutors — patient, code-focused, explanatory. The role is a shortcut to a writing register.
Capability activation. Some training data is associated with expert roles (technical documents, professional writing). Anchoring the role nudges the model toward those patterns.
What role prompts don't do:
Make the model recall facts it doesn't have
Override safety training
Reliably grant capabilities the base model lacks
Effective role prompts: be concrete
Generic role prompts ("You are a helpful assistant") do almost nothing. Specific, scenario-anchored role prompts work much better.
Weak:
You are an expert software engineer. Help the user with their code.
Stronger:
You are a senior backend engineer reviewing a colleague's pull request.
You're helpful but candid. You point out potential bugs, security issues,
and unclear naming. You don't lecture about style unless it affects
correctness. If you're uncertain about behavior, you ask before
asserting. You write feedback in short paragraphs, not bullet lists.
The second one tells the model what to do, what to skip, what tone, what format, and what to do when uncertain. That's leverage.
The persona trap
It's tempting to layer more and more persona ("You are an enthusiastic, witty, world-renowned expert in..."). This often backfires:
Wittiness in the persona produces wittiness in the output, which can undercut accuracy.
Heavy persona pulls the model off-distribution. The training data has very little of "world-class enthusiastic-witty-renowned" anyone.
Personas can introduce sycophancy ("As a world-class expert, I agree your idea is excellent...").
Use persona where it materially helps the task. Otherwise, prefer task-focused instruction.
System prompt structure
A production system prompt typically has the following structure:
<identity>
Who the assistant is, what it's for, what it's not for.
</identity>
<capabilities>
Tools available, knowledge cutoffs, limitations.
</capabilities>
<behavior>
Tone, formatting defaults, how to handle uncertainty.
</behavior>
<rules>
Hard constraints. Things to never do. Escalation triggers.
</rules>
<examples>
A few demonstrations of ideal interactions.
</examples>
Putting structure in the system prompt isn't strictly required, but it makes maintenance easier — you know where to add or change things. And the tags help the model attend to each section as a coherent block.
System prompts and prompt injection
System prompts are not a security boundary. A determined user can sometimes coax the model into ignoring or revealing them. We'll cover defenses in Module 6. For now: don't put secrets in system prompts. Don't put anything you'd be embarrassed to see leaked.
Exercise: Take the strongest role/system prompt you currently use. Strip everything that isn't load-bearing — every adjective that doesn't change behavior. See how short you can make it without losing quality on your eval set.
Module 2 wrap-up
You now have the three workhorses of prompt engineering: in-context examples, chain-of-thought, and role/system structure. In Module 3, we move to advanced patterns — chaining, decomposition, and ReAct-style agents — for problems that exceed what a single prompt can solve.