Multimodal Prompting
Time to See the Bigger Picture!
Let's craft the perfect prompt! Up until now, we've been communicating with AI using text alone. But the world isn't just words — it's photographs, charts, videos, audio recordings, and messy PDF documents from 2007 that someone really should have reformatted by now. In this chapter, you'll learn how to prompt AI systems that can see, hear, and read just like you do.
Beyond Text: Why Multimodal Matters
For most of AI's history, language models could only process text. You typed words in, and you got words back. If you wanted an AI to analyze an image, you needed a completely separate computer vision system. If you wanted it to understand audio, you needed a different speech recognition system. Each modality — text, images, audio, video — lived in its own isolated world.
That era is over. Multimodal AI refers to artificial intelligence systems that can process and generate multiple types of data — text, images, audio, video, and documents — within a single unified model. Modern multimodal models like GPT-4o, Claude, and Gemini can look at a photograph and describe what they see, read a chart and explain its trends, or listen to audio and provide a transcript. They combine capabilities that used to require an entire team of specialized systems.
Why does this matter for prompt engineering? Because the way you prompt a multimodal system is fundamentally different from prompting a text-only model. When you attach an image to your prompt, you're not just adding decoration — you're providing an entirely new source of information that the model must interpret, analyze, and integrate with your text instructions. The skill of writing effective multimodal prompts is quickly becoming one of the most valuable capabilities in the prompt engineer's toolkit.
flowchart LR
subgraph traditional["Traditional AI (separate systems)"]
direction LR
T1["Text"] --> LM["Language Model"] --> TR["Text Response"]
I1["Image"] --> VM["Vision Model"] --> LB["Labels"]
A1["Audio"] --> SM["Speech Model"] --> TS["Transcript"]
end
subgraph multimodal["Multimodal AI (unified system)"]
direction LR
T2["Text"] --> MM["Multimodal\nModel"]
I2["Image"] --> MM
A2["Audio"] --> MM
MM --> UR["Unified\nResponse"]
end
style traditional fill:#fff3e0,stroke:#e65100
style multimodal fill:#e8f5e9,stroke:#2e7d32
And here's the practical reality: most real-world information isn't pure text. Business reports have charts. Instruction manuals have diagrams. Social media posts have images. Medical records have scans. If your prompt engineering skills only work with text, you're leaving enormous value on the table.
Image Prompting Fundamentals
Image prompting is the practice of including images alongside text instructions to guide a multimodal AI system's analysis and response. This is the most common form of multimodal interaction, and it's where most people start their multimodal journey.
The basic pattern is straightforward: you attach an image and write a text prompt that tells the model what you want it to do with that image. But the quality of your text prompt dramatically affects the quality of the response. A vague prompt like "What's this?" produces a vague answer. A specific prompt like "Identify all safety hazards visible in this construction site photograph and categorize them by severity" produces actionable analysis.
Here are the key principles for effective image prompting:
- Be specific about what you want. Don't just ask "describe this image." Tell the model what aspects matter: composition, colors, objects, text, emotions, spatial relationships, or technical details.
- Provide context when helpful. If the image shows a medical scan, mention the type of scan and what you're looking for. If it's a product photo, explain the intended use case for the description.
- Set the output format. Just as with text-only prompts, specify whether you want a paragraph, bullet points, JSON, or a structured analysis.
- Ask focused questions. Multiple specific questions produce better results than one broad request.
1 2 3 4 5 6 7 8 9 | |
Think About What You're Really Asking
Words matter — let's get them right! When you send an image to a multimodal AI, the model doesn't "see" the image the way you do. It processes the image through a vision encoder that converts pixels into numerical representations. The text prompt you write determines what the model pays attention to in those representations. A good prompt is like pointing a flashlight in a dark room — it directs the model's attention exactly where it needs to go.
Image Description and Visual Question Answering
Two of the most fundamental multimodal tasks are image description and visual question answering, and understanding the difference between them will make you a more effective prompt engineer.
Image description is the task of generating a textual account of an image's content, composition, and visual characteristics. When you ask a model to "describe this image," you're requesting an image description. The model examines the visual content and produces text that captures what it sees — objects, people, settings, actions, colors, and spatial relationships.
Visual question answering (VQA) goes a step further. Instead of asking for a general description, you ask specific questions about the image's content. "How many people are sitting at the table?" "What color is the car in the background?" "Does this room appear to meet fire code requirements?" VQA is more targeted and typically more useful for practical applications.
| Task | Prompt Style | Best For |
|---|---|---|
| Image Description | "Describe what you see in this image" | Cataloging, accessibility, content moderation |
| Visual Question Answering | "Answer these specific questions about the image" | Analysis, inspection, research, fact-checking |
| Comparative Analysis | "Compare these two images and identify differences" | Quality control, before/after documentation |
| Spatial Reasoning | "Describe the layout and spatial relationships" | Architecture, interior design, navigation |
Here's a practical tip that surprises many beginners: when you need both a description and answers to specific questions, it's often better to make two separate requests rather than combining them. The model produces more thorough descriptions when focused solely on describing, and more accurate answers when focused solely on answering questions. This is the multimodal equivalent of the "one task per prompt" principle you learned in Chapter 4.
Document Analysis and PDF Processing
Not all images are photographs. Some of the most valuable multimodal use cases involve document analysis — using AI to extract, interpret, and summarize information from document images, scanned pages, and complex layouts that combine text, tables, and figures.
Document analysis is the process of using multimodal AI to extract structured information from images of documents, including scanned pages, photographed papers, forms, receipts, and similar materials. The model reads the text in the image, understands the layout, and interprets the relationships between different elements.
PDF processing extends this concept to handle PDF files, which may contain a mix of machine-readable text, embedded images, tables, charts, and complex formatting. Modern multimodal models can process PDF pages as images, extracting information that traditional text extraction tools often miss — especially from scanned documents, forms with handwriting, or pages with complex multi-column layouts.
1 2 3 4 5 6 7 8 | |
The power of multimodal document analysis becomes clear when you consider how much information lives trapped in "unstructured" formats. Every organization has filing cabinets (physical or digital) full of scanned documents, photographed whiteboards, legacy PDFs, and faxed forms. Multimodal AI can unlock that information without expensive specialized OCR software or manual data entry.
Expand to see the Document Processing Pipeline diagram specification.
Diagram: Document Processing Pipeline
This diagram illustrates how different document types flow through a multimodal AI processing pipeline.
Type: Flowchart (left to right)
Nodes:
-
Input Sources (left column, 4 boxes):
- Scanned Paper Documents
- Digital PDFs
- Photographed Whiteboards
- Screenshots
-
Processing Stage (center column):
- Multimodal AI Model (large central box)
- Text prompt with extraction instructions (attached note)
-
Output Types (right column, 4 boxes):
- Structured Data (JSON/CSV)
- Summary Reports
- Extracted Tables
- Searchable Text
Connections: Each input source connects to the central AI model. The AI model connects to each output type. A bidirectional arrow between the text prompt note and the AI model indicates the prompt guides processing.
Color scheme: Blue for input sources, orange for the AI processing stage, green for output types.
However, document analysis has important limitations to keep in mind. Models may struggle with very small text, unusual fonts, poor scan quality, or documents in languages they weren't heavily trained on. Always verify critical extracted information — especially numbers, dates, and proper nouns — against the original document. (An AI reading your tax form is helpful. An AI reading your tax form incorrectly is an audit waiting to happen.)
Code Interpretation
Code interpretation is the use of multimodal AI to analyze, explain, or debug code that appears in images — such as screenshots of code editors, photographs of textbook examples, or images of error messages. This capability sits at the intersection of vision and programming knowledge.
You might wonder why anyone would send an image of code instead of copying and pasting the text. In practice, it happens constantly. A student photographs a textbook exercise. A developer screenshots an error that appears in a complex IDE layout. A colleague shares a screenshot of a code review comment thread. A legacy system displays code in a format that resists text selection.
Effective code interpretation prompts should specify:
- The programming language (if it's not obvious from the image)
- What kind of analysis you want (explanation, debugging, optimization, translation)
- The skill level of the intended audience for the explanation
- Whether you want the model to reproduce the code as text (extremely useful for working with screenshot-sourced code)
1 2 3 4 5 6 7 8 | |
Chart Reading and Diagram Understanding
Two related but distinct skills round out the visual analysis toolkit: chart reading and diagram understanding. Both involve interpreting visual representations of information, but they require different prompting strategies.
Chart reading is the ability of a multimodal AI to interpret data visualizations — bar charts, line graphs, pie charts, scatter plots, and similar figures — and extract the data, trends, and insights they represent. When you send a chart to a multimodal model, it can identify axes, read labels, estimate values, detect trends, and even compare data series.
Diagram understanding is the ability to interpret non-data visual representations such as flowcharts, network diagrams, architecture diagrams, circuit schematics, and organizational charts. Diagrams convey relationships, processes, and structures rather than numerical data.
| Visual Type | What to Ask For | Example Prompt Fragment |
|---|---|---|
| Bar chart | Values, comparisons, rankings | "What are the top 3 categories by value?" |
| Line graph | Trends, inflection points, projections | "Describe the trend from Q1 to Q4" |
| Pie chart | Proportions, dominant segments | "What percentage does each segment represent?" |
| Flowchart | Process steps, decision points, paths | "Trace the path for a rejected application" |
| Network diagram | Connections, central nodes, clusters | "Which node has the most connections?" |
| Architecture diagram | Components, data flows, dependencies | "List all external service dependencies" |
For chart reading, one of the most useful techniques is asking the model to extract the underlying data. Rather than just describing what the chart shows, ask the model to reconstruct the data table. This gives you numbers you can verify, work with in spreadsheets, or use to create your own visualizations.
1 2 3 4 5 6 7 8 | |
A Chart Is Worth a Thousand Numbers
Use your words! When asking AI to read charts, always mention the chart type in your prompt. Saying "this bar chart shows..." gives the model a head start on interpretation. Also, if you can see axis labels or a legend in the image, mention them — the model sometimes misreads small text, and your hint can prevent errors.
Screenshot Analysis
Screenshot analysis is the use of multimodal AI to interpret, evaluate, or extract information from screenshots of software interfaces, websites, dashboards, and digital applications. This is one of the most immediately practical multimodal skills because screenshots are the universal language of tech support, bug reports, and design feedback.
Think about how much time you spend describing software problems in words. "There's a button in the upper right corner that seems to overlap with the dropdown menu when I resize the window, and also the text below the search bar is cut off." Now imagine attaching a screenshot and saying: "Identify all UI issues visible in this screenshot." The model can see exactly what you see and respond with precision.
Common screenshot analysis use cases include:
- Bug reporting: "Identify all visual bugs or UI inconsistencies in this screenshot"
- Design review: "Evaluate this landing page design for visual hierarchy and readability"
- Accessibility audit: "Check this interface for potential accessibility issues"
- Competitive analysis: "Compare these two dashboard screenshots and list the differences in layout and features"
- Documentation: "Generate step-by-step instructions based on this sequence of screenshots"
The key to effective screenshot analysis is providing context about what the screenshot shows and what you're looking for. A screenshot of a mobile app's settings page looks very different to someone troubleshooting a bug versus someone conducting a design review. Your prompt should make your intent clear.
Audio Transcription and Video Understanding
Multimodal AI is not limited to text and images. Increasingly, models can process audio and video content as well, though these capabilities are at different stages of maturity.
Audio transcription is the conversion of spoken language in audio files into written text, potentially with speaker identification, timestamps, and annotation of non-speech sounds. While dedicated transcription services have existed for years, multimodal AI models can now transcribe audio and analyze the content simultaneously — summarizing meetings, extracting action items, identifying sentiment, and answering questions about what was said.
Video understanding takes this further by combining visual and audio analysis over time. A multimodal model processing video can track objects and people across frames, understand actions and events, read on-screen text, and integrate the audio track — producing a rich understanding of the video's content.
1 2 3 4 5 6 7 | |
1 2 3 4 5 6 7 | |
It's worth noting that audio and video capabilities vary significantly across models and change rapidly. Some models handle audio natively, while others require the audio to be transcribed first by a separate service. Some can process video directly, while others work with individual frames extracted from the video. Always check the current capabilities of your specific model before building workflows that depend on audio or video processing.
Expand to see the Multimodal Capability Maturity diagram specification.
Diagram: Multimodal Capability Maturity Levels
This diagram shows the relative maturity of different multimodal capabilities in current AI systems.
Type: Horizontal stacked bar chart or maturity matrix
Rows (modalities):
- Text Analysis: Mature (full bar, dark green)
- Image Description: Mature (full bar, dark green)
- Document/PDF Analysis: Advanced (90% bar, light green)
- Chart and Diagram Reading: Advanced (85% bar, light green)
- Screenshot Analysis: Advanced (85% bar, light green)
- Code Interpretation from Images: Intermediate (75% bar, yellow)
- Audio Transcription: Intermediate (70% bar, yellow)
- Image Generation: Intermediate (70% bar, yellow)
- Video Understanding: Early (50% bar, orange)
- Real-time Audio Conversation: Early (45% bar, orange)
Labels: Each bar labeled with the maturity level (Mature, Advanced, Intermediate, Early). A note at the bottom: "Capabilities as of early 2026 — this landscape changes rapidly."
Color scheme: Dark green for Mature, light green for Advanced, yellow for Intermediate, orange for Early.
Image Generation Prompts
So far, we've focused on using AI to understand visual content. But multimodal AI also works in the other direction: generating images from text descriptions. Image generation prompts are text instructions that direct an AI model to create a visual image — a photograph, illustration, diagram, or any other visual output.
Writing effective image generation prompts is its own art form. The field has developed a rich vocabulary for controlling image output, and understanding this vocabulary gives you much more control over results.
Key elements of an effective image generation prompt include:
- Subject: What is in the image? ("a golden retriever puppy," "a futuristic city skyline")
- Style: What artistic style? ("photorealistic," "watercolor painting," "flat vector illustration," "pencil sketch")
- Composition: How is the scene arranged? ("close-up portrait," "wide-angle landscape," "bird's eye view")
- Lighting: What mood does the lighting create? ("soft golden hour light," "dramatic chiaroscuro," "bright studio lighting")
- Color palette: What colors dominate? ("warm earth tones," "cool blues and grays," "vibrant neon")
- Details and modifiers: What additional qualities matter? ("highly detailed," "minimalist," "vintage 1970s aesthetic")
1 2 3 4 5 6 7 8 9 10 | |
A word of caution: image generation models can produce impressive results, but they also have well-documented biases and limitations. They may default to certain demographics, reinforce stereotypes, or struggle with specific subjects (ask any AI to draw hands and you'll see what I mean). Thoughtful prompt engineering can mitigate some of these issues, but awareness of the limitations is essential.
Alt Text Generation: Accessibility as a Superpower
One of the most socially valuable applications of multimodal AI is alt text generation. Alt text (alternative text) is a written description of an image that makes visual content accessible to people who use screen readers, have slow internet connections, or otherwise cannot see the image. Every image on the web should have alt text, but the vast majority don't — or have alt text so generic it's useless ("image.jpg" is not helpful to anyone).
Multimodal AI can generate high-quality alt text at scale, which is a genuine accessibility breakthrough. But generating good alt text requires specific prompting techniques because alt text has its own conventions and best practices.
Effective alt text should be:
- Concise: Usually 125 characters or fewer for simple images
- Descriptive: Convey the essential information the image communicates
- Contextual: Reflect the image's purpose in its specific context
- Non-redundant: Don't repeat information already in the surrounding text
1 2 3 4 5 6 7 8 9 10 | |
Accessibility Is Everyone's Job
Here's something worth celebrating: every time you use AI to generate good alt text, you're making the internet more accessible for millions of people. That's not just good prompt engineering — that's using technology to make the world genuinely better. Great prompt engineering really can be your superpower!
Data Visualization Prompts
The final piece of the multimodal puzzle brings us full circle: using prompts to create visual representations of data. A data visualization prompt is a text instruction that directs an AI to generate charts, graphs, plots, or other visual representations of data.
This capability connects several skills you've already learned. You need clear instructions (Chapter 2), structured output specifications (Chapter 6), and often data provided as context (Chapter 7 or Chapter 8). The multimodal element is the visual output — instead of getting text back, you get a chart.
Effective data visualization prompts specify:
- The data: Provide the actual numbers, a table, or a reference to a dataset
- The chart type: Bar chart, line graph, scatter plot, heatmap, etc.
- The visual design: Colors, labels, title, legend placement
- The story: What insight should the visualization communicate?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | |
Some models generate visualizations as images, while others generate code (Python with matplotlib or plotly, JavaScript with D3, etc.) that you then run to produce the visualization. Either approach benefits from detailed prompting. The more specific you are about the desired output, the less time you spend iterating.
Putting It All Together: A Multimodal Workflow
Let's walk through a realistic scenario that combines multiple multimodal skills. Imagine you're a market analyst preparing a quarterly report. Here's how multimodal prompting fits into your workflow:
-
PDF Processing: Upload the competitor's 40-page quarterly earnings PDF. Prompt: "Extract all financial metrics from this report and organize them into a comparison table."
-
Chart Reading: The PDF contains revenue trend charts. Prompt: "Extract the approximate quarterly revenue data from this chart for the past 8 quarters."
-
Screenshot Analysis: Capture screenshots of the competitor's updated product page. Prompt: "Compare this screenshot with last quarter's version and identify all changes to pricing, features, and messaging."
-
Data Visualization: Use the extracted data to create your own charts. Prompt: "Create a line chart comparing our revenue growth vs. competitor growth over the past 8 quarters."
-
Image Generation: Create a professional header graphic for your report. Prompt: "Generate a clean, corporate-style header image for a quarterly competitive analysis report. Modern, minimalist design with blue and gray tones."
-
Alt Text Generation: Ensure your report is accessible. Prompt: "Generate alt text for each chart and figure in this report."
This workflow would have required multiple specialized tools and significant manual effort just a few years ago. With multimodal prompting, a single analyst can accomplish all of it through well-crafted natural language instructions.
| Workflow Step | Multimodal Skill | Time Saved |
|---|---|---|
| PDF data extraction | Document Analysis | Hours of manual reading |
| Chart data extraction | Chart Reading | Tedious manual estimation |
| Competitive UI comparison | Screenshot Analysis | Side-by-side squinting |
| Visualization creation | Data Visualization Prompts | Design tool wrestling |
| Report graphics | Image Generation | Stock photo hunting |
| Accessibility compliance | Alt Text Generation | Description writing for every figure |
Best Practices and Common Pitfalls
After covering the individual multimodal skills, let's consolidate the wisdom into actionable best practices.
Best practice 1: Always provide context. When uploading an image, tell the model what it's looking at and why. "This is an X-ray of a patient's left knee, taken after a fall" is infinitely better than just uploading an image with no explanation.
Best practice 2: Specify the output format. Multimodal responses can be verbose. If you want a structured analysis, say so. If you want a brief summary, say so. The model doesn't know your preferred format unless you tell it.
Best practice 3: Verify critical information. Multimodal models can misread text in images, miscount objects, misidentify people, and confuse spatial relationships. For any high-stakes application, treat the model's output as a first draft that requires human verification.
Best practice 4: Use high-quality inputs. A blurry photograph, a low-resolution screenshot, or a poorly scanned document will produce lower quality results. When possible, provide the highest quality visual input available.
Best practice 5: Break complex tasks into steps. Rather than asking the model to analyze a 50-page PDF in one go, process it page by page or section by section. This produces more thorough and accurate results.
Not Everything That Looks Right Is Right
Here's a trap that catches even experienced users: multimodal AI models can be confidently wrong about visual content. A model might read "128" when the chart says "138," or describe a green object as blue. The fluency of the response makes errors easy to miss. Always verify numbers, names, and critical details against the original visual source. Trust, but verify — especially when numbers are involved.
Common pitfalls to avoid:
- Sending images without text prompts. Attaching an image with no instructions forces the model to guess what you want. Always include explicit directions.
- Expecting pixel-perfect accuracy. Models estimate values from charts; they don't read them with the precision of data extraction software.
- Ignoring image resolution. Tiny text in a large image may be unreadable. Crop or zoom to the relevant section.
- Overlooking privacy. Images may contain personal information, faces, license plates, or other sensitive data. Be mindful of what you upload to AI services.
- Assuming all models are equal. Multimodal capabilities vary dramatically across models and versions. Test your specific use case with your specific model.
You Can See the Future — and It's Multimodal!
Use your words — and your images, charts, documents, and videos! You've now mastered the art of communicating with AI beyond text. These multimodal skills open up use cases that were simply impossible just a couple of years ago. Whether you're making the web more accessible, extracting data from legacy documents, or creating stunning visualizations, you've got the prompting skills to make it happen. Onward!
Key Takeaways
- Multimodal AI processes multiple data types in one model. Modern systems handle text, images, audio, video, and documents together — eliminating the need for separate specialized tools for each modality.
- Image prompts need specific instructions. Attaching an image without clear text directions produces vague results. Always tell the model what to analyze, what format to use, and what level of detail you need.
- Visual question answering outperforms generic description. Asking targeted questions about an image produces more useful responses than requesting a general description, especially for practical applications.
- Document and PDF analysis unlocks trapped information. Multimodal models can extract structured data from scanned documents, forms, and complex layouts that resist traditional text extraction.
- Chart reading requires verification. Models estimate data values from visualizations — they don't extract exact numbers. Always verify critical figures against source data.
- Screenshot analysis accelerates software workflows. Bug reporting, design review, and competitive analysis all benefit enormously from AI-powered screenshot interpretation.
- Alt text generation is an accessibility superpower. Using AI to generate descriptive alt text at scale makes visual content accessible to millions of people who rely on screen readers.
- Image generation prompts benefit from rich vocabulary. Specifying subject, style, composition, lighting, and color palette gives you much more control over generated images.
Concepts
- Multimodal AI
- Image Prompting
- Image Description
- Visual Question Answering
- Document Analysis
- PDF Processing
- Code Interpretation
- Chart Reading
- Diagram Understanding
- Screenshot Analysis
- Audio Transcription
- Video Understanding
- Image Generation Prompts
- Alt Text Generation
- Data Visualization Prompt