Name: Multimodal in Practice
Availability: InStock

> **Working claim:** Multimodal inputs are priced and billed differently from text in ways that surprise teams in the first invoice and again at scale. An image is not a flat add-on; its cost scales with resolution and tiling.

Working claim: Multimodal inputs are priced and billed differently from text in ways that surprise teams in the first invoice and again at scale. An image is not a flat add-on; its cost scales with resolution and tiling. Audio and video cost per minute and per sampled frame. The biggest lever is not a cheaper model, it is sending fewer, more-relevant pixels, which usually improves accuracy at the same time.

The invoice that was 30x the estimate

A team built a support-screenshot assistant and estimated cost from text-prompt pricing: a few cents per interaction. The first month's bill was roughly thirty times the estimate. The cause was not a bug; it was a misunderstanding of how images are billed. Each screenshot was high-resolution, and the model tiled high-resolution images into many patches, each patch costing tokens, so a single full-resolution screenshot consumed thousands of vision tokens, more than the entire text of the conversation around it. The team had modeled the image as a small fixed surcharge. It was the dominant cost, and it scaled with the resolution of whatever the user happened to upload.

This is the first thing to internalize: vision tokens are real tokens, billed like text tokens, and their count is a function of image resolution and the model's tiling scheme. As Chapter 4 explained, a vision-language model turns an image into a bounded set of visual tokens via a projection (LLaVA, Flamingo); higher-resolution inputs are commonly handled by tiling the image into multiple crops, each encoded separately, so the token count grows with resolution rather than staying flat. A small thumbnail might cost a few hundred tokens; a full-resolution multi-megapixel screenshot might cost several thousand. The cost model that works treats every image as a token quantity to be estimated and budgeted, not as a flat per-image fee.

Whiteboard-style technical sketch infographic for Cost, Latency, and the Price of Vision Tokens. — Vision cost and accuracy improve when the system localizes cheaply first, then reads only the smallest high-resolution crop that matters.

A cost model you can actually use

The fix is to build an explicit cost model per modality and run real inputs through it before launch, rather than extrapolating from text pricing. The model has to capture the things that actually drive cost: image resolution and tiling, audio minutes, video frames sampled, and the text tokens around them.

from dataclasses import dataclass

@dataclass
class Pricing:
 text_token_usd: float # per input token
 output_token_usd: float
 # Vision: tokens depend on resolution/tiling; estimate tokens, then price.
 tile_tokens: int # tokens per tile
 base_image_tokens: int # tokens for the low-res overview pass
 audio_minute_usd: float # transcription per minute

def image_tokens(width: int, height: int, p: Pricing,
 tile_px: int = 512) -> int:
 """Tokens grow with resolution because high-res images are tiled."""
 tiles_x = -(-width // tile_px) # ceil division
 tiles_y = -(-height // tile_px)
 return p.base_image_tokens + tiles_x * tiles_y * p.tile_tokens

def cost_screenshot_turn(img_w, img_h, prompt_tokens, output_tokens, p: Pricing):
 vt = image_tokens(img_w, img_h, p)
 return {
 "vision_tokens": vt,
 "input_usd": (vt + prompt_tokens) * p.text_token_usd,
 "output_usd": output_tokens * p.output_token_usd,
 "total_usd": (vt + prompt_tokens) * p.text_token_usd
 + output_tokens * p.output_token_usd,}

# The 30x surprise, made visible:
full_res = cost_screenshot_turn(2400, 1600, prompt_tokens=300, output_tokens=200, p=PR)
cropped = cost_screenshot_turn(700, 400, prompt_tokens=300, output_tokens=200, p=PR)
# full_res["vision_tokens"] >> cropped["vision_tokens"]: the lever is resolution.

Run this over a sample of real inputs (which are often higher-resolution than the demo's), and the monthly bill stops being a surprise. The exact token-per-tile numbers vary by provider and model and must be taken from current pricing documentation, the structure, not the constants, is the durable part. The point the code makes concrete is in the last two lines: the same screenshot at full resolution versus cropped to the relevant region differs by a large multiple of vision tokens, which is the lever the next section is about.

The biggest lever: send fewer, more-relevant pixels

Here is the happy alignment that makes multimodal cost optimization unusually pleasant: the cheapest input is often the most accurate input. Recall the resolution blind spot from Chapters 1 and 6, a model reading a full-frame screenshot may miss small text, and the fix is to crop the region of interest and send it at high resolution. That same crop is cheaper than the full frame, because it is fewer pixels and thus fewer tiles and tokens. Cropping reduces cost and improves accuracy simultaneously, which is rare in engineering and worth exploiting hard.

The pattern: do a cheap localization pass first (find the region of interest, the error dialog in the screenshot, the total on the invoice, the damaged area in the photo) using a low-resolution overview or a cheap detector, then send only that region at high resolution for the actual read. This two-pass approach, overview to localize, crop to read, costs roughly one low-res image plus one small high-res crop, far less than one full-resolution image, and reads the relevant detail better because the detail now occupies more of the (smaller) input. The GPT-4V system card's notes on small-text weakness and the tiling-based resolution handling in current models both point the same way: resolution should be spent where the answer is, not uniformly across a frame that is mostly irrelevant. A system that sends every input at full resolution is paying maximum cost for minimum accuracy on the detail that matters.

Audio and video: per-minute and per-frame economics

Audio and video have their own cost shapes. Transcription is typically priced per minute of audio, so cost scales with recording length, and the lever is to avoid transcribing what you do not need: voice-activity detection to skip silence, and for long recordings, transcribe on demand around a query rather than transcribing everything up front. Video is the most expensive modality because it compounds: every sampled frame is an image with image-token cost, so a video's cost is roughly (frames sampled) × (per-image tokens) + (audio minutes). This makes the Chapter 10 sampling decision a cost decision as well as an accuracy decision, sampling one frame per second of a ten-minute video is 600 image-encodings, and naive dense sampling can be ruinous. The reconciliation is the audio-guided sampling from Chapter 10: sample densely only where the audio or shot detection says something is happening, and sparsely elsewhere, which controls cost and catches the events that matter. Long-context models (Gemini 1.5) shift this calculus by allowing dense ingestion in fewer calls, but the tokens are still tokens and the bill still scales with how much media you feed, large context is a capability, not a discount.

Modality	Cost driver	Cheap lever	Accuracy interaction
Image	Resolution × tiling	Crop to region of interest	Cropping also improves small-detail accuracy
Document	Pages × per-page image tokens	Layout OCR for text; crop only marks/regions to VLM	Structure preserved cheaper than whole-page VLM
Audio	Minutes transcribed	VAD to skip silence; on-demand transcription	No accuracy cost to skipping silence
Video	Frames × image tokens + minutes	Audio/shot-guided sampling	Smart sampling also catches short events
Chart	Image tokens to read it	Extract from source data when available	Source data is exact and free of vision tokens

The table's through-line is that the cost lever and the accuracy lever are usually the same lever pulled in the same direction: spend pixels and minutes where the answer is, and you pay less and get more right. The chart row is the strongest case, when the underlying data exists, reading it costs zero vision tokens and is exactly correct, dominating any image-based read on both axes.

Latency is a separate budget with its own levers

Cost and latency are correlated but not identical, and conflating them leads to wrong optimizations. A high-resolution image costs more tokens and takes longer to process; cropping helps both. But some latency levers do not help cost and vice versa. Caching derived artifacts (Chapter 13) helps latency and cost on repeat access, if a document was already OCR'd, do not OCR it again, and is the highest-use move for any system that processes the same media more than once. Parallelism across modalities helps latency but not cost, transcribing audio while sampling frames in parallel shortens wall-clock time without changing the bill. Model selection trades both, a smaller model is cheaper and faster but may push more cases to human review, so the true cost includes the review cost, not just the API cost. The honest cost model therefore includes the fully-loaded cost per successful task: API cost plus the amortized human-review cost for the fraction that does not verify (Chapter 3). A cheaper model that doubles the review rate can be more expensive overall, and only the fully-loaded number reveals it.

Budgeting per successful task, not per call

The closing discipline ties cost back to the book's central frame. The unit that matters is not cost-per-API-call but cost-per-verified-successful-task, because an answer that goes to human review or comes back wrong has not done the job, and its API cost is partly wasted. A system with a low per-call cost but a high conflict rate is paying twice, once for the model call, once for the human review, and its true cost per completed invoice is higher than a system with a more expensive call that verifies cleanly. This reframes model and resolution choices as a single optimization: minimize the fully-loaded cost per verified task, which means spending enough on resolution and model quality to keep the verification rate high (avoiding review costs) while cropping and sampling to avoid spending tokens where the answer is not. The GPT-4o system card and its emphasis on a model optimized for multimodal efficiency is a reminder that providers are pushing the cost-per-capability down, but the system-level lever, pixels and minutes where the answer is, cached derivatives, and a verification rate high enough to keep humans out of the cheap cases, remains in the builder's hands and dominates the bill more than the choice of model.

Chapter summary

Multimodal inputs are billed in ways that surprise teams: vision tokens are real tokens whose count scales with image resolution and the model's tiling scheme (LLaVA/Flamingo-style projection), so a full-resolution screenshot can cost thousands of tokens and dominate the bill, the 30x-over-estimate incident. Build an explicit per-modality cost model (image tokens from resolution and tiling, audio per minute, video per sampled frame plus audio) and run real, often-higher-resolution inputs through it before launch. The biggest lever is sending fewer, more-relevant pixels, a cheap localization pass followed by a high-resolution crop of the region of interest, which reduces cost and improves small-detail accuracy at once, the rare case where the cost lever and the accuracy lever point the same way. Audio cost scales per minute (skip silence, transcribe on demand) and video compounds frames-as-images with audio minutes, making the sampling decision a cost decision reconciled by audio/shot-guided sampling; long-context models change the call structure but not the token-scaling bill. Latency is a separate budget with its own levers: caching derived artifacts (cost and latency), cross-modal parallelism (latency only), model selection (both, but include review cost). The unit that matters is fully-loaded cost per verified successful task, because an answer that routes to human review or returns wrong has not done the job and is paid for twice; a cheaper model that doubles the review rate can cost more overall. The system-level levers, pixels and minutes where the answer is, cached derivatives, and a high enough verification rate to keep humans out of cheap cases, dominate the bill more than the choice of model. This chapter follows the operational model built in Production Architecture for Multimodal Systems; the next chapter extends those operating concerns into Privacy, Safety, and High-Stakes Images.

Cost, Latency, and the Price of Vision Tokens