
2026 / Free online book · Technical Deep Dives
Multimodal in Practice
Building AI Systems That See, Hear, Read, and Still Fail in New Ways
Access
Free
Chapters
16
Read time
174 min
Adding images and audio to a system is not a parameter change. It is a new set of failure modes. This deep dive covers what multimodal models actually perceive, where they hallucinate differently, and the evaluation that catches it.
Vision and audio change the failure modes, not just the inputs. What breaks when the model has to see and hear.
This edition is free to read onsite. Each chapter has its own URL, so readers can bookmark, share, and return to the exact section they need.
Table of contents
FM Front Matter: Multimodal in Practice Building AI Systems That See, Hear, Read, and Still Fail in New Ways 5 min INT Introduction: The Photo of the Bumper A team I will call the claims team, the details are composited from several real projects, but the shape is exact, built something that demoed beautifully. 9 min 01 The Demo Is Not the System > **Working claim:** A model that accepts an image, audio clip, or document is not a system that reliably perceives, extracts, or decides. The demo measures the model on inputs it chose. 11 min 02 Every Modality Brings Its Own Blind Spots > **Working claim:** There is no general "multimodal" capability you can reason about in the abstract. 12 min 03 A Vision Answer Is Not Evidence > **Working claim:** A model's description, reading, or transcription of a non-text input is a *claim*, not a fact. 10 min 04 Encoders, Contrastive Learning, and Cross-Modal Alignment > **Working claim:** You do not need to train a vision-language model to build with one, but you do need an accurate mental model of how machines come to associate an image with words. 11 min 05 Joint Embedding Spaces and Their Limits > **Working claim:** A shared embedding space is the best tool there is for finding candidates across modalities and a dangerous tool for deciding truth. 9 min 06 Documents Are Multimodal Objects > **Working claim:** A document is not text and not an image. 9 min 07 OCR Is Not Document Understanding > **Working claim:** Reading the characters on a page and understanding the document are different problems with different metrics, different failure modes, and different owners. 9 min 08 Grounding, Bounding Boxes, and Visual Citations > **Working claim:** An extracted value or visual claim that cannot be pointed back to a specific region of a source, a pixel box on a page, an audio offset, a video frame, is unverifiable and therefore unsafe to act on. 10 min 09 Audio Is Not Text After Transcription This chapter turns audio is not text after transcription into a concrete operating problem for the multimodal book. 8 min 10 Video Is Not Just Frames This chapter turns video is not just frames into a concrete operating problem for the multimodal book. 9 min 11 Multimodal RAG and Cross-Modal Search This chapter turns multimodal rag and cross-modal search into a concrete operating problem for the multimodal book. 8 min 12 Evaluation: Seeing Is Not Verifying > **Working claim:** A single accuracy number on a multimodal system is almost always a lie of omission. 9 min 13 Production Architecture for Multimodal Systems > **Working claim:** A multimodal system in production is mostly a pipeline for turning raw media into versioned, provenance-linked derived artifacts and keeping them consistent as models and extraction logic change. 9 min 14 Cost, Latency, and the Price of Vision Tokens > **Working claim:** Multimodal inputs are priced and billed differently from text in ways that surprise teams in the first invoice and again at scale. An image is not a flat add-on; its cost scales with resolution and tiling. 8 min 15 Privacy, Safety, and High-Stakes Images This chapter turns privacy, safety, and high-stakes images into a concrete operating problem for the multimodal book. 10 min 16 The Use Case Field Guide > **Working claim:** The disciplines in this book, name the modality and its blind spot, define the object of truth, preserve derived artifacts with provenance, ground every claim, verify before acting, and evaluate by slice, are not abstract. 10 min A Appendix A: Back Matter Glossary, implementation checklist, and source register for the book. 8 min
