Text isn't the only input anymore. Modern models can read images, parse documents, transcribe audio, and reason about diagrams. This module covers the techniques that make multimodal prompts reliable — and the failure modes that bite when you treat images like text.
3 lessons·~100 minutes
LESSON 9.1
How vision models actually see
The mental model from Module 1 — "the model predicts next tokens" — still applies, but the tokens have new sources. When you send an image alongside text, the image is converted into a sequence of image tokens via a vision encoder. Those tokens get fed into the same transformer that processes text tokens. The model attends across all of them together.
This matters for how you prompt. The model isn't "looking at" the image in any human sense. It's pattern-matching against the image's encoded representation, the same way it pattern-matches against text. The implications:
Things in the training data (common objects, common chart types, common document layouts) are recognized well
Things outside the training distribution (unusual layouts, niche specialized diagrams, low-quality scans) degrade ungracefully
Spatial reasoning ("how far apart are these two things?") is weak — the model knows positions roughly, not precisely
Reading small text, dense tables, or handwriting is where most models still struggle
Resolution and detail
Vision models have a token budget for images, just like text. Higher resolution = more tokens consumed. APIs typically have two modes:
Low detail / fast: ~85 tokens per image, regardless of size. Fine for "what's the dominant color" or "is there a person in this photo."
High detail: hundreds to thousands of tokens, scales with image size. Needed for reading text, examining small details, complex diagrams.
Default to high detail when document accuracy matters. Use low detail for casual visual reasoning to save money and latency.
The "describe what you see, then answer" trick
For complex visual questions, ask the model to describe the image first, then answer:
Look at this image and answer the user's question.
First, in <observation> tags, describe what you see in detail —
objects, layout, text content, any anomalies. Be thorough.
Then in <answer> tags, answer the user's question using your
observation.
Question: {user_question}
This forces the model to extract image content into tokens it can then reason over, instead of trying to reason directly from image tokens (which it does worse at). Same pattern as chain-of-thought, applied to vision.
What vision models reliably get right
Identifying objects, scenes, basic actions
Reading clear printed text (signs, captions, screenshots)
Describing layouts (left, right, top, foreground)
Counting small numbers of objects (≤ ~5)
Recognizing famous people, places, brands
Identifying chart types and summarizing trends qualitatively
What they often get wrong
Reading dense or low-resolution text (forms, receipts)
Reading handwriting
Precise counting (more than ~10)
Precise spatial measurements ("the box is 2.3cm tall")
Tables with merged cells, footnotes, or non-standard layouts
Multi-page documents — performance degrades sharply past 5–10 pages
Math from images (especially handwritten equations)
Test before you trust. A vision model that's 95% accurate on common cases can be 30% accurate on your specific edge cases. Build an eval set with real samples from your domain.
LESSON 9.2
Document understanding and structured extraction
Pulling structured data out of unstructured documents — invoices, receipts, contracts, IDs, forms — is one of the highest-ROI uses of multimodal models. It's also where naive prompts fail spectacularly. This lesson is about the patterns that work.
The schema-first prompt
The single biggest improvement over "extract the data from this invoice" is providing an explicit output schema. The model now has a shape to fill in, not a vague task to interpret.
Extract the following fields from this invoice image. Output JSON
matching this exact schema. If a field cannot be found, set it to null.
{
"invoice_number": string | null,
"issue_date": string | null, // YYYY-MM-DD
"due_date": string | null, // YYYY-MM-DD
"vendor": {
"name": string | null,
"address": string | null,
"tax_id": string | null
},
"line_items": [
{ "description": string, "quantity": number, "unit_price": number, "total": number }
],
"subtotal": number | null,
"tax": number | null,
"total": number | null,
"currency": string | null // ISO 4217 code
}
Rules:
- Use literal values from the document. Do not infer or compute beyond what's shown.
- Dates: parse from any visible format into YYYY-MM-DD.
- If the same field appears multiple times, prefer the most prominent occurrence.
- Numbers: strip currency symbols and thousands separators. Use the period as decimal.
[Image attached]
Compare to "extract the data": the schema version produces parseable, predictable output. The rules section handles the edge cases that always come up (date formats, repeated fields, number formatting).
Cite your sources
For high-stakes extraction (contracts, financial documents, anything legal), require the model to cite where it found each value:
For each extracted field, also include a "source" entry: a brief
description of where on the page the value came from
(e.g., "top right, line 'Invoice #INV-2024-088'").
This lets a human reviewer audit your output without re-reading
the whole document.
Citations dramatically reduce review time for human-in-the-loop workflows.
Multi-page documents: chunk and merge
Vision models handle 5–10 pages reasonably; beyond that, accuracy degrades. For long documents:
Split into pages or logical sections
Extract from each page independently
Merge results in code, handling collisions (e.g., totals appearing on multiple pages — usually take the last one)
This is "map-reduce" applied to documents — same as we used for long-text summarization in Module 3.
OCR or vision-native?
Two paths exist for document understanding:
OCR-first: use a dedicated OCR service (AWS Textract, Google Document AI, Tesseract) to extract text, then prompt an LLM with the text
Vision-native: send the image directly to a vision-capable LLM (Claude, GPT-4o, Gemini)
Use OCR-first when
Use vision-native when
Volume is very high (cost-sensitive)
Volume is moderate
Documents are uniform (one OCR template works)
Documents vary (different vendor invoice layouts)
You need exact character positions (bounding boxes)
Layout matters semantically (tables, hierarchies)
Text quality is poor (handwriting, scans)
Document includes diagrams or charts that need interpretation
Modern vision-native models are often within 5-10% of dedicated OCR for printed text, and dramatically better for layout-aware extraction. For most new projects, start vision-native.
Validation is not optional
Extracted data from images is the easiest place for silent corruption to enter your pipeline. A `0` that should have been `8`, a `.` instead of `,`, a missing zero — these slip past humans too. Build validators:
Internal consistency: subtotal + tax = total, line items sum to subtotal
Format validation: dates parseable, currency codes valid
Range checks: amounts within plausible ranges for your domain
Cross-document checks: if you've seen this vendor before, do their tax IDs match?
Failed validation routes to human review, doesn't get rejected. The cost of false negatives (sending a legit doc to humans) is far lower than false positives (accepting bad data into your system).
Exercise: Run your extraction prompt on 20 real samples from your domain. Tabulate which fields fail most often. Add explicit handling for those failure modes to your prompt and validator.
LESSON 9.3
Charts, diagrams, screenshots, and other rich content
Beyond documents, multimodal models open up workflows that were previously hand-coded: parsing charts, debugging from screenshots, describing UIs, analyzing whiteboard photos. This lesson covers the patterns that scale.
Charts and data visualization
Vision models can read charts but with characteristic weaknesses. They reliably identify chart types (bar, line, pie, scatter) and the overall trend ("revenue grew, then plateaued"). They struggle to:
Read exact numeric values from axes
Distinguish closely-grouped data points
Identify which line is which when legend colors are similar
The right way to use them: ask for qualitative analysis, not precise numbers.
You are analyzing a business chart. Look at this image and answer:
1. What type of chart is it?
2. What variables are being shown? (x-axis, y-axis, series)
3. Describe the overall trend in 1-2 sentences
4. Identify any obvious outliers or anomalies
5. List 2-3 questions a stakeholder might want to investigate
6. Do NOT estimate exact numeric values; describe shapes and trends only.
If you need exact numbers, get them from the underlying data, not the chart.
Screenshot-to-action
One of the most powerful patterns: a user sends a screenshot of their broken UI, an error dialog, or a confusing form. The model identifies the issue and produces actionable steps.
The user is asking for help. They've sent a screenshot of their screen.
Your job:
1. Describe what the user is looking at (the UI, the screen, the error)
2. Identify the most likely problem they're trying to solve
3. Give 3 specific actions they can try, in order from easiest to most involved
4. If you can see an error message or code, quote it exactly and explain what it means
5. If you need more information to help, ask one specific clarifying question
Be concrete. "Click the gear icon at the top right" beats "go to settings."
User message: {message}
This pattern works for support agents, IT helpdesks, in-app help features, code review tools, and many more. The key is treating the screenshot as evidence, not as the question — the question is implicit in "I'm stuck."
UI design feedback from mockups
Send a design mockup and ask for usability feedback. The model has seen many UIs in training and reasonably evaluates against common heuristics.
You're a senior product designer reviewing this UI mockup.
Evaluate:
- Visual hierarchy: is the most important action clear?
- Accessibility: contrast, target sizes, label clarity
- Consistency: spacing, typography, component patterns
- Cognitive load: too many elements? Buried key actions?
Give specific, actionable feedback. Reference what you see — e.g.,
"the 'Save' button is the same size as 'Cancel', which is unusual
for a primary action."
Skip generic advice like "use white space." Be specific to this mockup.
Won't replace a real design review, but catches obvious issues cheaply.
Whiteboard, sketch, and handwritten note OCR
Vision models can transcribe handwriting, but quality varies enormously by clarity. For whiteboards and sketches:
Best results: dark marker on clean whiteboard, good lighting, photo taken straight-on
Acceptable: paper notes in pen, clear handwriting, reasonable photo
Risky: shadows, glare, low contrast, mixed text + diagrams
For mission-critical handwriting recognition, fall back to dedicated handwriting OCR services. For "convert my brainstorm notes to a markdown outline," vision-native LLMs are great.
Audio and video
Modern frontier models increasingly support audio input (Gemini, GPT-4o realtime, Claude in some configurations). The patterns are similar to images:
Transcription works well for clear speech in common languages
Speaker identification ("who said what") is harder
Tone, emotion, and emphasis are partially detected but not reliably
Long audio (> 10 min) often needs chunking, like long documents
Video is the frontier. As of writing, models can analyze short clips (seconds to a minute), but long-form video understanding is expensive and inconsistent. For most use cases today, sample keyframes and analyze them as images.
When to use multimodal at all
Not every problem with images needs an LLM. Cheaper, more reliable alternatives often exist:
Task
Better alternative
"Is there a face in this photo?"
Dedicated face detection (much cheaper)
"What objects are in this image?" (closed taxonomy)
Custom CV model or Rekognition / Cloud Vision
"Read this barcode/QR code"
Standard barcode library
"Is this image similar to that one?"
Image embeddings (CLIP)
"Read this exact form layout"
Templated OCR (Textract, Document AI)
Reach for multimodal LLMs when the task is open-ended, the input varies a lot, or you need reasoning about content (not just detection).
Exercise: Pick a workflow at your job that involves manually reviewing images or documents. Mock up a vision-prompt solution. Test it on 20 real samples. The economics often work out shockingly well at $0.01-0.05 per call.
Module 9 wrap-up
Multimodal models extend prompt engineering into territory that used to require custom ML pipelines: visual classification, document parsing, screenshot understanding. The principles from earlier modules apply — explicit schemas, citation, validation, evals — but the failure modes are different. Test with your real data before trusting the output. In Module 10 we'll bring it all together into production agents that can take real actions.