Failure Modes and Recovery
Where the system breaks, what early signals matter, and how to recover.
What To Measure
Measurement has to start from consequence. For latency budgets, the headline metric is completed task rate by perceived wait. It should be segmented by task type, source quality, user intent, and risk level.
A single average hides the cases that matter. The system can look strong overall while failing the small set of requests that create support tickets, compliance exposure, or customer distrust.
Sampling
The sample should include routine work, edge cases, recent failures, and examples from the newest traffic. Static test sets are useful, but they become stale. Production changes faster than a spreadsheet of favorite examples.
Sampling also protects the team from politics. When examples are selected by habit or seniority, the evaluation becomes a negotiation. When examples come from a documented sampling rule, the conversation moves back to evidence.
Interpreting Scores
A score matters only when it changes an action. Decide in advance what score blocks a release, what score triggers review, and what score expands the rollout. Without thresholds, measurement becomes decoration.
The team should inspect both good and bad examples. Good examples show what the system is learning to do reliably. Bad examples show whether the failure can be fixed by policy, data, design, or scope reduction.
The Review Loop
The review loop is where quality compounds. Record the failure, classify it, fix the appropriate layer, and rerun the affected cases. Keep the history visible so the team can see whether it is improving or merely changing.
budget time by user expectation before choosing a model. That practice keeps the evaluation connected to the work instead of becoming a compliance ritual.
Research Lens
The research base for The Latency Budget matters because latency budgets sits between capability and consequence. Papers, benchmarks, and risk frameworks can show what is possible, but production teams still have to translate that evidence into decisions. This chapter treats research as a constraint on judgment, not as decoration.
The most useful research habit is to separate mechanism from outcome. A paper can show that a method improves a benchmark. It does not prove that the same method improves completed task rate by perceived wait in your product. That gap is where evaluation, sampling, and release discipline belong.
For this chapter, read external sources as pressure tests. If a source describes a known weakness, ask whether your system can observe that weakness. If a source describes a benchmark gain, ask whether your users send the same kind of work. If a source describes a risk, ask who owns it after launch.
Measurement method
Start with a written task statement. It should name the user, the input, the expected output, the source of truth, and the action that follows. If any of those pieces are missing, latency budgets is not ready for broad automation because the team cannot tell whether the result is good enough.
Next, define the control surface. For this topic, the control surface includes request classes, model routing, streaming, cache policy, and timeout behavior. Each control should have a reason to exist and a way to be tested. A control that cannot be tested becomes process theater. A control that can be tested becomes part of the operating system.
Finally, decide what the system does when the answer is not ready. The mature options are ask for more context, return a partial answer with evidence, route to a person, or stop. The immature option is to keep generating until the output sounds confident.
Measurement evidence
Evidence should be collected at the same grain as the decision. If the decision is where speed is worth more than marginal quality, the review set should contain examples that force that decision. A broad score is useful only after the team has inspected the cases that carry the most cost.
The strongest evidence combines observed user work, known edge cases, recent incidents, and synthetic pressure tests. Synthetic examples are useful when they fill a known gap. They are dangerous when they replace the real distribution the system must serve.
A good review record includes the input, the relevant context, the output, the expected answer, the judgment, and the fix. Without that record, quality work becomes memory work. With it, the team can see whether the system is learning, drifting, or merely changing shape.
Implementation Notes
Implementation should begin with the smallest useful workflow. The first version should be narrow enough that the team can replay every important failure. If replay is not possible, the system is not observable enough for serious use.
The second version should add volume without changing the promise. This is where completed task rate by perceived wait should be watched closely. If the metric improves while support tickets, corrections, or handoffs rise, the measurement is missing something important.
The third version can expand scope only after the team knows which failures are acceptable, which failures require escalation, and which failures require rollback. Expansion without that knowledge creates a system that appears productive while quietly moving risk to the customer.
Decision Review
At the end of the chapter, the team should be able to answer four questions. What promise are we making? What evidence supports it? What happens when the promise fails? Who has authority to change the promise? These questions are simple, but they expose most weak deployments.
The answer should not live only in a meeting note. It should appear in the evaluation suite, the release checklist, the incident process, and the product experience. Users do not need to see the internal machinery, but they do need to feel its discipline.
The Latency Budget is ultimately about replacing vague confidence with accountable practice. The point is not to slow teams down. The point is to make speed repeatable, explainable, and safe enough to build a business on.
Failure Modes and Recovery operating table
| Area | What to inspect | Decision evidence |
|---|---|---|
| Metric | completed task rate by perceived wait | Release, rollback, or further review |
| Sample | Routine cases, edge cases, recent failures, and new traffic. | where speed is worth more than marginal quality |
| Threshold | A score only matters when it changes a decision. | budget time by user expectation before choosing a model |
What to carry forward
- Turn known failures into observable states.
- Use completed task rate by perceived wait as the anchor metric.
- Make this decision explicit: Where speed is worth more than marginal quality.
- Budget time by user expectation before choosing a model.
- Connect scores to release decisions.
- Keep fresh production examples in the review set.
