AN Alpesh Nakrani
BlogBooksPraiseAbout Work with me →
Book overview
Front Matter / Technical Deep Dives

Front Matter: Multimodal in Practice

Building AI Systems That See, Hear, Read, and Still Fail in New Ways

Read this alongside the Multimodal book, the AI-Native thesis, and the full book library when you want the surrounding argument.

Book promise

A model that can accept an image is not a reliable visual system. A model that can transcribe audio is not a meeting-understanding system. A model that can read a chart is not a data analyst. Multimodality expands capability, and in the same motion it expands ambiguity, attack surface, evaluation complexity, latency, cost, and the number of distinct ways a product can be wrong.

This is a practical, systems-minded guide to building production multimodal AI: systems that ingest images, scanned documents, screenshots, audio calls, video, charts, and mixed media, and that have to keep working when users upload a blurry photo, a screenshot of a PDF, a cropped form, an accented call, or a video where the relevant moment is two seconds long. It is written for builders who have already seen a visual-reasoning or speech-to-text demo work beautifully on clean inputs and now have to make it survive contact with real data.

This manuscript is not a short brief, not a topic outline, and not a marketing summary. It is for AI engineers, product engineers, MLOps engineers, document-AI teams, support-automation teams, data engineers, and technical founders who need to understand how vision-language and audio systems actually behave, where the "the model can just see it" reflex breaks, and how to build ingestion, grounding, evaluation, storage, and review as separate, governed subsystems.

It is not a computer vision textbook, not a speech recognition course, not a vendor-specific API manual, and not a benchmark leaderboard book. It uses CLIP, vision-language models, OCR, speech recognition, multimodal benchmarks, and safety work accurately, but explains them as engineering building blocks, not as research objects.

The core thesis

Multimodal systems do not just add new inputs. They add new ways to be wrong.

The recurring motif

Every modality brings its own blind spots.

Text can be ambiguous. Images can hide tiny details in shadow or resolution. Audio can mishear an accent or a noisy room. Video can lose temporal causality between frames. Documents mix layout, typography, tables, stamps, signatures, checkboxes, and scanned noise into a single object that is not "an image" and not "text." A multimodal system has to preserve, evaluate, and monitor those differences rather than flatten them into one pipeline.

The second motif, used throughout

A vision answer is not evidence until it is verified.

A fluent description of an invoice, a confident reading of a chart, a clean transcript of a call, these are claims, not facts. The work of this book is largely the work of turning model claims into verified, grounded, auditable results: tying every extracted value back to a pixel region, a page, a timestamp, or a structured re-extraction that can disagree with the model and win.

The enemy

The belief this book exists to correct:

"Multimodal AI is just text prompting with extra inputs. The model can see the image, so we don't need OCR, layout parsing, grounding, slice evals, or human review anymore."

A model that accepts an image accepts a down-sampled, tokenized version of that image, reasons over it probabilistically, and produces fluent text whether or not the relevant detail survived. That is genuinely useful. It is not a measurement instrument, not a document parser of record, not a transcription guarantee, and not a substitute for verifying what the system claims it saw.

Primary research references

These anchor the book. Individual chapters use their own chapter-specific sources; this is the shared spine.

The MODAL Framework

One framework recurs through the book. Whenever a multimodal feature is on the table, ask five questions:

  • M: Modality. What kind of signal is this really: text, image, audio, video, scanned document, chart, screenshot, or a mix? What are its specific blind spots?
  • O: Object of truth. What exactly must be answered, extracted, classified, located, or acted on? A description is not an object of truth; a field value, a bounding box, a timestamp, or a decision is.
  • D: Derived artifacts. What representations will the system create, transcript, OCR text, page-image crop, caption, embedding, table JSON, frame index, and which of them is the system actually reasoning over?
  • A: Alignment. How is every derived artifact tied back to its source location: page and pixel box, audio offset, video frame, file hash? Without alignment there is no citation and no audit.
  • L: Limits. Which modality-specific errors (perception, OCR, grounding, temporal, reasoning, policy) must be measured before launch, sliced by lighting, language, accent, device, template, and angle?

MODAL is a lens, not a template. It will not appear as a forced subsection in every chapter. It is the question set a mature multimodal system can answer for any input it processes.

Table of contents

Movement I: The Demo Is Not the System

  1. The Demo Is Not the System
  2. Every Modality Brings Its Own Blind Spots
  3. A Vision Answer Is Not Evidence

Movement II: How Machines Learn Cross-Modal Meaning

  1. Encoders, Contrastive Learning, and Cross-Modal Alignment
  2. Joint Embedding Spaces and Their Limits

Movement III: Documents Are Multimodal Objects

  1. Documents Are Multimodal Objects
  2. OCR Is Not Document Understanding
  3. Grounding, Bounding Boxes, and Visual Citations

Movement IV: Audio and Video Are Time Systems

  1. Audio Is Not Text After Transcription
  2. Video Is Not Just Frames

Movement V: Multimodal RAG and Search

  1. Multimodal RAG and Cross-Modal Search

Movement VI: Evaluation: Seeing Is Not Verifying

  1. Evaluation: Seeing Is Not Verifying

Movement VII: Production Architecture

  1. Production Architecture for Multimodal Systems
  2. Cost, Latency, and the Price of Vision Tokens
  3. Privacy, Safety, and High-Stakes Images

Movement VIII: Use Case Field Guide

  1. The Use Case Field Guide

Back matter

  • Glossary
  • Implementation Checklist
  • Research and Source Register
Share