Name: Systems That Ship
Availability: InStock

The assistant failed for five different reasons that looked like one reason to users: it was unreliable. Sometimes the document was stale.

The assistant failed for five different reasons that looked like one reason to users: it was unreliable. Sometimes the document was stale. Sometimes the eval set missed a case. Sometimes latency made users abandon the answer. Sometimes inference cost triggered throttling. Sometimes the permission filter removed context the model needed. Reliability was not one thing. It was the combined integrity of the path.

AI products are hardened across the whole path.

Hardening is the work that makes a scoped behavior survive operational reality. It covers data readiness, evaluation coverage, cost and latency budgets, security controls, permissions, fallback paths, and recovery. Hardening should be built into the product plan, not scheduled after "AI integration" is done.

Research spine

This chapter uses: OpenAI Evals; OWASP Top 10 for Large Language Model Applications; NIST AI Risk Management Framework; Google SRE Book; DORA, State of AI-assisted Software Development 2025.

Data hardening

Data hardening asks whether sources are current, authorized, versioned, structured, retrievable, and aligned with the product's scope. AI products that depend on stale, conflicting, or unauthorized data will fail even if the model is strong. Data contracts and freshness checks belong in the product backlog.

Evaluation hardening

Evals should represent normal cases, edge cases, high-risk cases, adversarial cases, and historical failures. They should be updated when the product learns. For AI features, an eval suite is not a research artifact; it is a release gate and regression safety net.

Cost, latency, and security hardening

Cost and latency are product qualities. A helpful answer that arrives too late or costs too much will not survive. Security hardening includes prompt-injection defenses, permission checks, tenant isolation, output handling, logging, and escalation. These cannot be treated as independent afterthoughts because they interact. A stronger permission filter may reduce answer quality; a lower-cost model may change reliability; a longer context may improve accuracy and break latency.

Operating table

Hardening area	Failure	Control
Data	Stale or unauthorized context	Freshness checks and permissions
Evals	Known failure repeats	Golden sets and regression cases
Cost	Margin collapse or throttling	Budgets, routing, caching
Latency	User abandonment	Budgeting and async design
Security	Leakage or tool misuse	Filters, isolation, least privilege
Fallback	No recovery path	Escalation and rollback

Artifact example: a hardening gate for an AI product

hardening_gate:
 data:
 freshness_slo_hours: 24
 permission_filter_required: true
 evals:
 min_cases: 250
 required_slices: ["normal", "edge", "high_risk", "historical_incident"]
 pass_threshold: 0.92
 cost:
 max_cost_per_successful_task: "$0.65"
 latency:
 p95_ms: 2500
 async_allowed_for: ["long_document_analysis"]
 security:
 prompt_injection_tests: true
 tenant_isolation_tests: true
 fallback:
 human_escalation: true
 kill_switch: true

User request to AI output pipeline with data, evals, cost, latency, security, and fallback hardening checkpoints preventing leaks — Hardening is a pipeline discipline: data, evals, cost, latency, security, and fallback checks keep failures from leaking into production.

Checklist

Harden data before scaling retrieval or generation.
Make evals a release gate.
Set cost and latency budgets as product requirements.
Test prompt injection and permission boundaries.
Design fallback before rollout.

Takeaway

Hardening is not one control; it is the integrity of the whole AI path.

Operational note: Reliability is compositional

Users experience one product, but failures may originate in data, model, retrieval, permissions, cost, latency, or operations. In the context of Harden the Path: Data, Evals, Cost, Latency, Security, the practical danger is not that the team lacks effort; it is that effort is aimed at the wrong scarce resource. The durable AI product operations argument says that the old visible unit of work is no longer the safest unit of management. A team can produce more drafts, more code, more messages, more analysis, or more tickets while becoming less reliable at the point where the business needs a decision. The fix is to move the management surface away from raw output and toward evidence: what was decided, by whom, from which inputs, against which criteria, with what rollback path.

A mature implementation treats this as an operating-system concern rather than a personal-performance concern. The artifact should make the judgment visible: the rubric, acceptance gate, cost line, risk boundary, owner, and expiry date. When those fields are missing, the model's speed hides organizational ambiguity. When they are present, AI acceleration becomes tractable because the team can see which decisions deserve automation, which deserve human review, and which deserve rejection before execution begins.

The useful test is whether a new teammate can replay the decision two weeks later without interviewing the original author. If replay requires folklore, the process is still human-memory-bound. If replay can be done from the artifact, the team has converted judgment into infrastructure. That conversion is the recurring discipline throughout this book: not replacing human judgment, but making human judgment explicit enough that machines can safely do more of the surrounding work.

Field expansion: Budgets are design tools

Cost and latency budgets force architecture decisions before scale exposes them. In the context of Harden the Path: Data, Evals, Cost, Latency, Security, the practical danger is not that the team lacks effort; it is that effort is aimed at the wrong scarce resource. The durable AI product operations argument says that the old visible unit of work is no longer the safest unit of management. A team can produce more drafts, more code, more messages, more analysis, or more tickets while becoming less reliable at the point where the business needs a decision. The fix is to move the management surface away from raw output and toward evidence: what was decided, by whom, from which inputs, against which criteria, with what rollback path.

Design consequence: Security affects product quality

Permission and safety controls change what the system can see and do, so they must be co-designed with functionality. In the context of Harden the Path: Data, Evals, Cost, Latency, Security, the practical danger is not that the team lacks effort; it is that effort is aimed at the wrong scarce resource. The durable AI product operations argument says that the old visible unit of work is no longer the safest unit of management. A team can produce more drafts, more code, more messages, more analysis, or more tickets while becoming less reliable at the point where the business needs a decision. The fix is to move the management surface away from raw output and toward evidence: what was decided, by whom, from which inputs, against which criteria, with what rollback path.

Harden the Path: Data, Evals, Cost, Latency, Security