Harden the Path: Data, Evals, Cost, Latency, Security
The assistant failed for five different reasons that looked like one reason to users: it was unreliable. Sometimes the document was stale.
The assistant failed for five different reasons that looked like one reason to users: it was unreliable. Sometimes the document was stale. Sometimes the eval set missed a case. Sometimes latency made users abandon the answer. Sometimes inference cost triggered throttling. Sometimes the permission filter removed context the model needed. Reliability was not one thing. It was the combined integrity of the path.
AI products are hardened across the whole path.
Hardening is the work that makes a scoped behavior survive operational reality. It covers data readiness, evaluation coverage, cost and latency budgets, security controls, permissions, fallback paths, and recovery. Hardening should be built into the product plan, not scheduled after "AI integration" is done.
Research spine
This chapter uses: OpenAI Evals; OWASP Top 10 for Large Language Model Applications; NIST AI Risk Management Framework; Google SRE Book; DORA, State of AI-assisted Software Development 2025.
Data hardening
Data hardening asks whether sources are current, authorized, versioned, structured, retrievable, and aligned with the product's scope. AI products that depend on stale, conflicting, or unauthorized data will fail even if the model is strong. Data contracts and freshness checks belong in the product backlog.
Evaluation hardening
Evals should represent normal cases, edge cases, high-risk cases, adversarial cases, and historical failures. They should be updated when the product learns. For AI features, an eval suite is not a research artifact; it is a release gate and regression safety net.
Cost, latency, and security hardening
Cost and latency are product qualities. A helpful answer that arrives too late or costs too much will not survive. Security hardening includes prompt-injection defenses, permission checks, tenant isolation, output handling, logging, and escalation. These cannot be treated as independent afterthoughts because they interact. A stronger permission filter may reduce answer quality; a lower-cost model may change reliability; a longer context may improve accuracy and break latency.
Operating table
| Hardening area | Failure | Control |
|---|---|---|
| Data | Stale or unauthorized context | Freshness checks and permissions |
| Evals | Known failure repeats | Golden sets and regression cases |
| Cost | Margin collapse or throttling | Budgets, routing, caching |
| Latency | User abandonment | Budgeting and async design |
| Security | Leakage or tool misuse | Filters, isolation, least privilege |
| Fallback | No recovery path | Escalation and rollback |
Artifact example: a hardening gate for an AI product
hardening_gate:
data:
freshness_slo_hours: 24
permission_filter_required: true
evals:
min_cases: 250
required_slices: ["normal", "edge", "high_risk", "historical_incident"]
pass_threshold: 0.92
cost:
max_cost_per_successful_task: "$0.65"
latency:
p95_ms: 2500
async_allowed_for: ["long_document_analysis"]
security:
prompt_injection_tests: true
tenant_isolation_tests: true
fallback:
human_escalation: true
kill_switch: true
Checklist
- Harden data before scaling retrieval or generation.
- Make evals a release gate.
- Set cost and latency budgets as product requirements.
- Test prompt injection and permission boundaries.
- Design fallback before rollout.
Takeaway
Hardening is not one control; it is the integrity of the whole AI path.
Operational note: Reliability is compositional
Users experience one product, but failures may originate in data, model, retrieval, permissions, cost, latency, or operations. In the context of Harden the Path: Data, Evals, Cost, Latency, Security, the practical danger is not that the team lacks effort; it is that effort is aimed at the wrong scarce resource. The durable AI product operations argument says that the old visible unit of work is no longer the safest unit of management. A team can produce more drafts, more code, more messages, more analysis, or more tickets while becoming less reliable at the point where the business needs a decision. The fix is to move the management surface away from raw output and toward evidence: what was decided, by whom, from which inputs, against which criteria, with what rollback path.
A mature implementation treats this as an operating-system concern rather than a personal-performance concern. The artifact should make the judgment visible: the rubric, acceptance gate, cost line, risk boundary, owner, and expiry date. When those fields are missing, the model's speed hides organizational ambiguity. When they are present, AI acceleration becomes tractable because the team can see which decisions deserve automation, which deserve human review, and which deserve rejection before execution begins.
The useful test is whether a new teammate can replay the decision two weeks later without interviewing the original author. If replay requires folklore, the process is still human-memory-bound. If replay can be done from the artifact, the team has converted judgment into infrastructure. That conversion is the recurring discipline throughout this book: not replacing human judgment, but making human judgment explicit enough that machines can safely do more of the surrounding work.
Field expansion: Budgets are design tools
Cost and latency budgets force architecture decisions before scale exposes them. In the context of Harden the Path: Data, Evals, Cost, Latency, Security, the practical danger is not that the team lacks effort; it is that effort is aimed at the wrong scarce resource. The durable AI product operations argument says that the old visible unit of work is no longer the safest unit of management. A team can produce more drafts, more code, more messages, more analysis, or more tickets while becoming less reliable at the point where the business needs a decision. The fix is to move the management surface away from raw output and toward evidence: what was decided, by whom, from which inputs, against which criteria, with what rollback path.
A mature implementation treats this as an operating-system concern rather than a personal-performance concern. The artifact should make the judgment visible: the rubric, acceptance gate, cost line, risk boundary, owner, and expiry date. When those fields are missing, the model's speed hides organizational ambiguity. When they are present, AI acceleration becomes tractable because the team can see which decisions deserve automation, which deserve human review, and which deserve rejection before execution begins.
The useful test is whether a new teammate can replay the decision two weeks later without interviewing the original author. If replay requires folklore, the process is still human-memory-bound. If replay can be done from the artifact, the team has converted judgment into infrastructure. That conversion is the recurring discipline throughout this book: not replacing human judgment, but making human judgment explicit enough that machines can safely do more of the surrounding work.
Design consequence: Security affects product quality
Permission and safety controls change what the system can see and do, so they must be co-designed with functionality. In the context of Harden the Path: Data, Evals, Cost, Latency, Security, the practical danger is not that the team lacks effort; it is that effort is aimed at the wrong scarce resource. The durable AI product operations argument says that the old visible unit of work is no longer the safest unit of management. A team can produce more drafts, more code, more messages, more analysis, or more tickets while becoming less reliable at the point where the business needs a decision. The fix is to move the management surface away from raw output and toward evidence: what was decided, by whom, from which inputs, against which criteria, with what rollback path.
A mature implementation treats this as an operating-system concern rather than a personal-performance concern. The artifact should make the judgment visible: the rubric, acceptance gate, cost line, risk boundary, owner, and expiry date. When those fields are missing, the model's speed hides organizational ambiguity. When they are present, AI acceleration becomes tractable because the team can see which decisions deserve automation, which deserve human review, and which deserve rejection before execution begins.
The useful test is whether a new teammate can replay the decision two weeks later without interviewing the original author. If replay requires folklore, the process is still human-memory-bound. If replay can be done from the artifact, the team has converted judgment into infrastructure. That conversion is the recurring discipline throughout this book: not replacing human judgment, but making human judgment explicit enough that machines can safely do more of the surrounding work.
Managerial implication: Reliability is compositional
Users experience one product, but failures may originate in data, model, retrieval, permissions, cost, latency, or operations. In the context of Harden the Path: Data, Evals, Cost, Latency, Security, the practical danger is not that the team lacks effort; it is that effort is aimed at the wrong scarce resource. The durable AI product operations argument says that the old visible unit of work is no longer the safest unit of management. A team can produce more drafts, more code, more messages, more analysis, or more tickets while becoming less reliable at the point where the business needs a decision. The fix is to move the management surface away from raw output and toward evidence: what was decided, by whom, from which inputs, against which criteria, with what rollback path.
A mature implementation treats this as an operating-system concern rather than a personal-performance concern. The artifact should make the judgment visible: the rubric, acceptance gate, cost line, risk boundary, owner, and expiry date. When those fields are missing, the model's speed hides organizational ambiguity. When they are present, AI acceleration becomes tractable because the team can see which decisions deserve automation, which deserve human review, and which deserve rejection before execution begins.
The useful test is whether a new teammate can replay the decision two weeks later without interviewing the original author. If replay requires folklore, the process is still human-memory-bound. If replay can be done from the artifact, the team has converted judgment into infrastructure. That conversion is the recurring discipline throughout this book: not replacing human judgment, but making human judgment explicit enough that machines can safely do more of the surrounding work.
Production implication: Budgets are design tools
Cost and latency budgets force architecture decisions before scale exposes them. In the context of Harden the Path: Data, Evals, Cost, Latency, Security, the practical danger is not that the team lacks effort; it is that effort is aimed at the wrong scarce resource. The durable AI product operations argument says that the old visible unit of work is no longer the safest unit of management. A team can produce more drafts, more code, more messages, more analysis, or more tickets while becoming less reliable at the point where the business needs a decision. The fix is to move the management surface away from raw output and toward evidence: what was decided, by whom, from which inputs, against which criteria, with what rollback path.
A mature implementation treats this as an operating-system concern rather than a personal-performance concern. The artifact should make the judgment visible: the rubric, acceptance gate, cost line, risk boundary, owner, and expiry date. When those fields are missing, the model's speed hides organizational ambiguity. When they are present, AI acceleration becomes tractable because the team can see which decisions deserve automation, which deserve human review, and which deserve rejection before execution begins.
The useful test is whether a new teammate can replay the decision two weeks later without interviewing the original author. If replay requires folklore, the process is still human-memory-bound. If replay can be done from the artifact, the team has converted judgment into infrastructure. That conversion is the recurring discipline throughout this book: not replacing human judgment, but making human judgment explicit enough that machines can safely do more of the surrounding work.
Operational note: Security affects product quality
Permission and safety controls change what the system can see and do, so they must be co-designed with functionality. In the context of Harden the Path: Data, Evals, Cost, Latency, Security, the practical danger is not that the team lacks effort; it is that effort is aimed at the wrong scarce resource. The durable AI product operations argument says that the old visible unit of work is no longer the safest unit of management. A team can produce more drafts, more code, more messages, more analysis, or more tickets while becoming less reliable at the point where the business needs a decision. The fix is to move the management surface away from raw output and toward evidence: what was decided, by whom, from which inputs, against which criteria, with what rollback path.
A mature implementation treats this as an operating-system concern rather than a personal-performance concern. The artifact should make the judgment visible: the rubric, acceptance gate, cost line, risk boundary, owner, and expiry date. When those fields are missing, the model's speed hides organizational ambiguity. When they are present, AI acceleration becomes tractable because the team can see which decisions deserve automation, which deserve human review, and which deserve rejection before execution begins.
The useful test is whether a new teammate can replay the decision two weeks later without interviewing the original author. If replay requires folklore, the process is still human-memory-bound. If replay can be done from the artifact, the team has converted judgment into infrastructure. That conversion is the recurring discipline throughout this book: not replacing human judgment, but making human judgment explicit enough that machines can safely do more of the surrounding work.
