MODULE 09

Multimodal Prompting

Text isn't the only input anymore. Modern models can read images, parse documents, transcribe audio, and reason about diagrams. This module covers the techniques that make multimodal prompts reliable — and the failure modes that bite when you treat images like text.

3 lessons·~100 minutes
LESSON 9.1

How vision models actually see

The mental model from Module 1 — "the model predicts next tokens" — still applies, but the tokens have new sources. When you send an image alongside text, the image is converted into a sequence of image tokens via a vision encoder. Those tokens get fed into the same transformer that processes text tokens. The model attends across all of them together.

This matters for how you prompt. The model isn't "looking at" the image in any human sense. It's pattern-matching against the image's encoded representation, the same way it pattern-matches against text. The implications:

Resolution and detail

Vision models have a token budget for images, just like text. Higher resolution = more tokens consumed. APIs typically have two modes:

Default to high detail when document accuracy matters. Use low detail for casual visual reasoning to save money and latency.

The "describe what you see, then answer" trick

For complex visual questions, ask the model to describe the image first, then answer:

Look at this image and answer the user's question.

First, in <observation> tags, describe what you see in detail —
objects, layout, text content, any anomalies. Be thorough.

Then in <answer> tags, answer the user's question using your
observation.

Question: {user_question}

This forces the model to extract image content into tokens it can then reason over, instead of trying to reason directly from image tokens (which it does worse at). Same pattern as chain-of-thought, applied to vision.

What vision models reliably get right

What they often get wrong

Test before you trust. A vision model that's 95% accurate on common cases can be 30% accurate on your specific edge cases. Build an eval set with real samples from your domain.
LESSON 9.2

Document understanding and structured extraction

Pulling structured data out of unstructured documents — invoices, receipts, contracts, IDs, forms — is one of the highest-ROI uses of multimodal models. It's also where naive prompts fail spectacularly. This lesson is about the patterns that work.

The schema-first prompt

The single biggest improvement over "extract the data from this invoice" is providing an explicit output schema. The model now has a shape to fill in, not a vague task to interpret.

Extract the following fields from this invoice image. Output JSON
matching this exact schema. If a field cannot be found, set it to null.

{
  "invoice_number": string | null,
  "issue_date": string | null,  // YYYY-MM-DD
  "due_date": string | null,    // YYYY-MM-DD
  "vendor": {
    "name": string | null,
    "address": string | null,
    "tax_id": string | null
  },
  "line_items": [
    { "description": string, "quantity": number, "unit_price": number, "total": number }
  ],
  "subtotal": number | null,
  "tax": number | null,
  "total": number | null,
  "currency": string | null  // ISO 4217 code
}

Rules:
- Use literal values from the document. Do not infer or compute beyond what's shown.
- Dates: parse from any visible format into YYYY-MM-DD.
- If the same field appears multiple times, prefer the most prominent occurrence.
- Numbers: strip currency symbols and thousands separators. Use the period as decimal.

[Image attached]

Compare to "extract the data": the schema version produces parseable, predictable output. The rules section handles the edge cases that always come up (date formats, repeated fields, number formatting).

Cite your sources

For high-stakes extraction (contracts, financial documents, anything legal), require the model to cite where it found each value:

For each extracted field, also include a "source" entry: a brief
description of where on the page the value came from
(e.g., "top right, line 'Invoice #INV-2024-088'").

This lets a human reviewer audit your output without re-reading
the whole document.

Citations dramatically reduce review time for human-in-the-loop workflows.

Multi-page documents: chunk and merge

Vision models handle 5–10 pages reasonably; beyond that, accuracy degrades. For long documents:

  1. Split into pages or logical sections
  2. Extract from each page independently
  3. Merge results in code, handling collisions (e.g., totals appearing on multiple pages — usually take the last one)

This is "map-reduce" applied to documents — same as we used for long-text summarization in Module 3.

OCR or vision-native?

Two paths exist for document understanding:

Use OCR-first whenUse vision-native when
Volume is very high (cost-sensitive)Volume is moderate
Documents are uniform (one OCR template works)Documents vary (different vendor invoice layouts)
You need exact character positions (bounding boxes)Layout matters semantically (tables, hierarchies)
Text quality is poor (handwriting, scans)Document includes diagrams or charts that need interpretation

Modern vision-native models are often within 5-10% of dedicated OCR for printed text, and dramatically better for layout-aware extraction. For most new projects, start vision-native.

Validation is not optional

Extracted data from images is the easiest place for silent corruption to enter your pipeline. A `0` that should have been `8`, a `.` instead of `,`, a missing zero — these slip past humans too. Build validators:

Failed validation routes to human review, doesn't get rejected. The cost of false negatives (sending a legit doc to humans) is far lower than false positives (accepting bad data into your system).

Exercise: Run your extraction prompt on 20 real samples from your domain. Tabulate which fields fail most often. Add explicit handling for those failure modes to your prompt and validator.
LESSON 9.3

Charts, diagrams, screenshots, and other rich content

Beyond documents, multimodal models open up workflows that were previously hand-coded: parsing charts, debugging from screenshots, describing UIs, analyzing whiteboard photos. This lesson covers the patterns that scale.

Charts and data visualization

Vision models can read charts but with characteristic weaknesses. They reliably identify chart types (bar, line, pie, scatter) and the overall trend ("revenue grew, then plateaued"). They struggle to:

The right way to use them: ask for qualitative analysis, not precise numbers.

You are analyzing a business chart. Look at this image and answer:

1. What type of chart is it?
2. What variables are being shown? (x-axis, y-axis, series)
3. Describe the overall trend in 1-2 sentences
4. Identify any obvious outliers or anomalies
5. List 2-3 questions a stakeholder might want to investigate
6. Do NOT estimate exact numeric values; describe shapes and trends only.

If you need exact numbers, get them from the underlying data, not the chart.

Screenshot-to-action

One of the most powerful patterns: a user sends a screenshot of their broken UI, an error dialog, or a confusing form. The model identifies the issue and produces actionable steps.

The user is asking for help. They've sent a screenshot of their screen.

Your job:
1. Describe what the user is looking at (the UI, the screen, the error)
2. Identify the most likely problem they're trying to solve
3. Give 3 specific actions they can try, in order from easiest to most involved
4. If you can see an error message or code, quote it exactly and explain what it means
5. If you need more information to help, ask one specific clarifying question

Be concrete. "Click the gear icon at the top right" beats "go to settings."

User message: {message}

This pattern works for support agents, IT helpdesks, in-app help features, code review tools, and many more. The key is treating the screenshot as evidence, not as the question — the question is implicit in "I'm stuck."

UI design feedback from mockups

Send a design mockup and ask for usability feedback. The model has seen many UIs in training and reasonably evaluates against common heuristics.

You're a senior product designer reviewing this UI mockup.

Evaluate:
- Visual hierarchy: is the most important action clear?
- Accessibility: contrast, target sizes, label clarity
- Consistency: spacing, typography, component patterns
- Cognitive load: too many elements? Buried key actions?

Give specific, actionable feedback. Reference what you see — e.g.,
"the 'Save' button is the same size as 'Cancel', which is unusual
for a primary action."

Skip generic advice like "use white space." Be specific to this mockup.

Won't replace a real design review, but catches obvious issues cheaply.

Whiteboard, sketch, and handwritten note OCR

Vision models can transcribe handwriting, but quality varies enormously by clarity. For whiteboards and sketches:

For mission-critical handwriting recognition, fall back to dedicated handwriting OCR services. For "convert my brainstorm notes to a markdown outline," vision-native LLMs are great.

Audio and video

Modern frontier models increasingly support audio input (Gemini, GPT-4o realtime, Claude in some configurations). The patterns are similar to images:

Video is the frontier. As of writing, models can analyze short clips (seconds to a minute), but long-form video understanding is expensive and inconsistent. For most use cases today, sample keyframes and analyze them as images.

When to use multimodal at all

Not every problem with images needs an LLM. Cheaper, more reliable alternatives often exist:

TaskBetter alternative
"Is there a face in this photo?"Dedicated face detection (much cheaper)
"What objects are in this image?" (closed taxonomy)Custom CV model or Rekognition / Cloud Vision
"Read this barcode/QR code"Standard barcode library
"Is this image similar to that one?"Image embeddings (CLIP)
"Read this exact form layout"Templated OCR (Textract, Document AI)

Reach for multimodal LLMs when the task is open-ended, the input varies a lot, or you need reasoning about content (not just detection).

Exercise: Pick a workflow at your job that involves manually reviewing images or documents. Mock up a vision-prompt solution. Test it on 20 real samples. The economics often work out shockingly well at $0.01-0.05 per call.

Module 9 wrap-up

Multimodal models extend prompt engineering into territory that used to require custom ML pipelines: visual classification, document parsing, screenshot understanding. The principles from earlier modules apply — explicit schemas, citation, validation, evals — but the failure modes are different. Test with your real data before trusting the output. In Module 10 we'll bring it all together into production agents that can take real actions.