Name: Multimodal in Practice
Availability: InStock

Read this alongside the Multimodal book, the AI-Native thesis, and the full book library when you want the surrounding argument.

Glossary

Alignment (MODAL "A"): The tie between a derived artifact and its source location: page + bounding box, audio offset, video frame, file hash. Without alignment there is no citation and no audit.

Bounding box (bbox): A rectangle locating a claim within an image or document page, normalized as (x, y, w, h). Pipeline-derived boxes (from OCR geometry) support verification; model-emitted boxes are hints of the same epistemic status as the claim.

Blind spot: The characteristic, modality-specific failure each modality introduces: small detail (image), the inverting word (audio), events between samples (video), lost structure (document), the exact value (chart).

Box origin: Whether a bounding box came from an independent pipeline (trustworthy for verification) or from the model itself (a hint only).

Causality (temporal): The before/after order of events in video. Reasoning over an unordered bag of frames can reverse it; order must be preserved and presented.

Cross-modal retrieval: Retrieving items of one modality with a query of another (text→image, image→product, audio→moment) via a shared embedding space. A candidate-finder, not a decision-maker.

Derived artifact: Any representation produced from raw media: OCR text, layout, caption, embedding, thumbnail, crop, transcript, table JSON, frame index. The durable, version-stamped output a production multimodal system actually accumulates.

Diarization: Labeling who spoke each segment of audio. Discarded by a flat transcript; needed wherever authority or accountability matters.

Disposition: The verification status of a claim: verified, unverified, conflict, or out-of-envelope. Automatic action is a privilege earned by reaching verified, not a default.

Grounding: Attaching a source region to every claim so it can be independently re-inspected. The mechanism that makes verification, human review, and audit possible.

Input-quality gate: A cheap pre-model check (resolution, blur, exposure, modality) that detects out-of-envelope inputs and reroutes them to re-capture, a human, or a different pipeline before spending a model call.

MODAL: The book's framework:Modality, Object of truth, Derived artifacts, Alignment, Limits. A lens for any multimodal input, not a chapter template.

Object of truth (MODAL "O"), What exactly must be answered, extracted, located, or acted on, a field value, a box, a timestamp, a decision, as opposed to a description. Checkable by construction.

Provenance: The recorded origin of a derived artifact: which source bytes (by hash), which region, which pipeline version, which model. A claim without provenance cannot be traced, verified, or deleted.

Reprocessing: The standing operation of re-deriving artifacts when a model or pipeline improves, gated by slice-aware eval and comparing new to old before superseding, so an upgrade does not silently create an inconsistent corpus.

Sampling (video): Choosing which frames to process. A deletion decision: any event shorter than the sampling interval is invisible. Reconciled by shot-, motion-, and audio-guided sampling.

Slice evaluation: Reporting accuracy per condition (lighting, accent, template, device, language) rather than as an average, and gating launches on the worst meaningful slice, because the average hides the slice where a new customer gets hurt.

Special-category data: Health, biometric, or children's data (GDPR Art. 9) requiring stricter handling; tagged at ingest to drive retention, access, and external-model routing.

Supersession: A document page (or a derived artifact) overriding an earlier one. Modeled explicitly so the system does not answer from a stale version.

Verified claim: The unit of output in a multimodal system: a value plus its grounding plus the outcomes of independent verifiers plus a disposition. Replaces returning a bare answer.

Vision tokens: The bounded set of tokens an image becomes after projection; billed like text tokens, scaling with resolution and tiling. The dominant and most surprising cost in many multimodal systems.

Word error rate (WER): The standard transcription metric. Averages over all words and is therefore blind to errors on the decision-bearing minority (negations, numbers, names) that actually flip outcomes.

Implementation Checklist

A team's multimodal system is approaching production-ready when it can answer yes, with evidence to each of these. Grouped by movement.

Perception and the demo trap (Movements I-II)

Every input modality the product accepts has a written blind-spot review: modality named honestly, object of truth defined as something checkable, blind spot located, the metric that would hide it identified, and the external antidote + sliced measurement specified.
An input-quality gate runs before the model: resolution, blur, and exposure heuristics, plus modality detection that reroutes screenshots-of-documents to the document pipeline and offers re-capture for bad inputs.
Failures are tagged with a fixed taxonomy (perception / OCR / grounding / temporal / reasoning / policy), and the distribution drives where the team invests.
The team can state, for each feature, why a similarity score is being used as a candidate-finder and not as a verdict, and what structured attributes make the actual decision.

Documents, audio, video (Movements III-IV)

Documents are represented with a structure-preserving schema (text spans with boxes, table cells with merged-cell spans, key-value pairs, visual-mark state, page supersession, file hash), not flattened to text.
Extraction produces typed, labeled, grounded fields located by label and position, verified at the understanding level (consistency, attribution, supersession), and reported as field/task accuracy, never OCR accuracy, to anyone making an automation decision.
Handwriting is flagged as lower-confidence in the schema and high-stakes handwritten fields default to human review.
Audio is represented as diarized, timestamped segments with token-level confidence; decision-bearing tokens (negations, numbers, names) are flagged and verified, and the audio (not just the transcript) is retained.
Video is ingested as a time-ordered, multi-track object with audio/shot-guided sampling stress-tested against deliberately short events, on-screen text OCR'd, and causality preserved.

Retrieval, evaluation, operations (Movements V-VII)

Multimodal retrieval routes by object of truth, preserves region-level provenance through retrieval, filters on permission/recency/quality/modality before ranking, and reranks heterogeneous candidates with a content-aware model; the answer cites chunk IDs and each material claim is re-verified against its region.
Perception and reasoning are evaluated separately; the perception-vs-conclusion confusion matrix is used to surface "right for wrong reasons" cases; launches are gated on the worst meaningful slice, not the average.
LLM-as-judge is used only for reasoning-given-gold-perception, format/helpfulness, and triage, never as perception ground truth.
Originals are immutable; every derived artifact links to its source by hash and records pipeline version and model id; inconsistency across versions is queryable; reprocessing is gated by slice-aware eval with compare-before-supersede.
Per-step processing events feed dashboards for input-quality, conflict rate, human-review rate, latency, and per-document vision-token cost; drift is alertable.
Human review is sized from the measured review rate, made efficient by grounding-based visual citations, and feeds corrections back into the eval set.
A pre-written runbook exists for "wrong extraction acted on, " covering reproduce, localize, blast-radius-across-slice, contain, fix-and-reprocess, regress.

Cost, privacy, safety (Movements VII-VIII)

A per-modality cost model is run on real (often high-resolution) inputs; cost is budgeted as fully-loaded cost per verified successful task, including human-review cost.
The system crops to the region of interest (serving accuracy, cost, and privacy at once) and caches derived artifacts.
Metadata is stripped, sensitive regions redacted, and sensitivity classified at ingest; special-category media cannot reach an external model without consent (enforced by a policy gate).
Deletion fans out to every derived artifact including embeddings in the vector index and provider-side copies, producing an auditable tombstone.
High-stakes image use is decision support with calibrated abstention and human authority, with thresholds set by consequence-weighted error and tuned toward over-escalation.

Research and Source Register

Sources grouped by chapter. A source appears under a chapter only if that chapter actually uses it to support a claim.

Front matter / Introduction: synthetic; draws on the book's own argument. No external citations.

Ch. 1, The Demo Is Not the System

Ch. 2, Every Modality Brings Its Own Blind Spots

Ch. 3, A Vision Answer Is Not Evidence

Ch. 4, Encoders, Contrastive Learning, and Cross-Modal Alignment

Ch. 5, Joint Embedding Spaces and Their Limits

Ch. 6, Documents Are Multimodal Objects

Ch. 7, OCR Is Not Document Understanding

Ch. 8, Grounding, Bounding Boxes, and Visual Citations

Ch. 9, Audio Is Not Text After Transcription

Ch. 10, Video Is Not Just Frames

Ch. 11, Multimodal RAG and Cross-Modal Search

Ch. 12: Evaluation: Seeing Is Not Verifying

Ch. 13, Production Architecture for Multimodal Systems

Ch. 14, Cost, Latency, and the Price of Vision Tokens

Ch. 15, Privacy, Safety, and High-Stakes Images

Ch. 16, The Use Case Field Guide

Appendix A: Back Matter