Examples Are the Executable Core
Most arguments about requirements disappear when people write examples.
Research spine: this chapter stays grounded in NIST AI Risk Management Framework and NIST Secure Software Development Framework, then applies that evidence to the operating judgment in the book. Most arguments about requirements disappear when people write examples.
A product manager says, "Admins should see all reports." Security says, "Only reports for their department." Customer success says, "Enterprise admins expect cross-department visibility." Engineering says, "The current role model cannot express that." Legal says, "EU tenant data cannot be visible to US-based support admins." Everyone thought they agreed until an example forced the hidden conflict into daylight.
Examples are where intent becomes testable. They are also where AI-assisted development becomes safer. A model can generate an implementation from prose, but examples give it anchors. Reviewers can debate examples more productively than abstractions. Test suites can preserve examples after the meeting ends.
Specification by Example is not new. Gojko Adzic's work popularized the practice of using concrete examples to build shared understanding and drive acceptance tests. The reason it belongs in this book is that AI makes examples more valuable. A model can transform examples into tests, fixtures, UI states, API calls, and documentation. The better the examples, the safer the generation.
A good example has five parts:
- context;
- action;
- expected behavior;
- negative or edge condition;
- why the example matters.
For the admin reports conflict, examples reveal the domain:
Feature: Report visibility for tenant admins
Scenario: Department admin sees only department reports
Given user Priya is an admin for the "Finance" department
And report R1 belongs to "Finance"
And report R2 belongs to "Sales"
When Priya opens the reports page
Then she sees R1
And she does not see R2
Scenario: Enterprise admin sees all departments in same tenant
Given user Mateo is an enterprise admin for tenant T1
And report R1 belongs to Finance in T1
And report R2 belongs to Sales in T1
When Mateo opens the reports page
Then he sees R1 and R2
Scenario: Support admin cannot cross data residency boundary
Given support admin Lena is based in the US
And tenant T2 is EU-resident
When Lena searches for reports in T2
Then no report content is displayed
And an access-denied audit event is recorded
These examples do more than illustrate. They constrain. They show that "admin" is not one role. They expose tenant, department, and data residency. They create fixtures. They can become automated tests. They tell an AI coding agent which conditions matter.
Counterexamples are equally important. They define what the system must not do. AI-generated code often implements the happy path. Counterexamples protect the boundary.
For a coupon system:
| Example type | Case | Expected result |
|---|---|---|
| Happy path | Valid coupon on eligible plan | Discount applied |
| Counterexample | Expired coupon | Rejected with reason |
| Counterexample | Coupon for annual plan used on monthly plan | Rejected |
| Edge case | Coupon applied at renewal boundary | Applied according to invoice date rule |
| Abuse case | Same coupon attempted across tenants | Rejected and audited |
| Compatibility case | Legacy coupon without campaign ID | Supported until migration date |
The counterexamples are not pessimism. They are the behavioral perimeter.
Examples also support AI review. A reviewer can ask: which example does this generated code satisfy? Which counterexample does it violate? If the diff does not map to examples, the reviewer is forced back into subjective reading. Example traceability reduces review burden because it lets the reviewer test intent rather than infer it.
A traceability matrix:
| Outcome | Example | Code area | Test | Owner |
|---|---|---|---|---|
| Department admin sees only own reports | Department admin scenario | ReportPolicy.visible_reports | test_department_admin_scope | Security + Product |
| Enterprise admin sees tenant-wide reports | Enterprise admin scenario | ReportPolicy.visible_reports | test_enterprise_admin_scope | Product |
| Residency boundary holds | Support admin EU scenario | SupportAccessService | test_support_residency_denied | Legal + Security |
This matrix looks simple. It is powerful because it gives the model, reviewer, and future maintainer a map from intent to implementation.
Examples should include data, not only prose. A fixture can be more precise than a paragraph. For a tax calculation, a table of input invoices and expected outputs may define behavior better than any description. For a data pipeline, a before/after dataset can lock transformation semantics. For a recommender, a set of queries and expected item rankings can define retrieval quality. For an AI support assistant, examples can include prompt, retrieved evidence, acceptable answer, unacceptable answer, and policy citation.
A compact fixture:
{
"account": {"tenant": "T1", "currency": "USD"},
"invoice": {"subtotal_cents": 10000, "tax_region": "CA"},
"coupon": {"type": "percent", "value": 10, "applies_to": "subtotal"},
"expected": {
"discount_cents": 1000,
"taxable_subtotal_cents": 9000,
"explanation": "Coupon applies before tax"
}
}
A model can generate code from this. A test can verify code against it. A reviewer can discuss whether the business rule is right. The fixture becomes a shared artifact.
Examples should be maintained. Production incidents often reveal missing examples. If a customer finds a bug around annual contracts, do not merely patch the code. Add the example. If a reviewer catches a generated implementation that leaks cross-tenant data, add the counterexample. If a support team discovers that a workflow behaves differently for suspended accounts, add the state. The example library is the team's operational memory.
There is a risk: examples can overfit. A system can pass the listed examples and fail the underlying rule. That is why examples must be paired with constraints and properties. But examples remain the best starting point because they make the rule concrete. They are not the whole spec. They are the executable core.
Key Takeaways
- Most arguments about requirements disappear when people write examples.
- The practical test is whether a team can name the evidence, owner, and failure mode before it changes behavior.
- Read this with The Spec Is the Program and the adjacent chapters when you need the wider AI SDLC and Specs frame.
Example quality
A weak example is vague in miniature. "Given a customer, when they cancel, then cancellation works" does not help. It hides role, contract type, invoice state, payment provider, and timing. A strong example names the state that changes behavior. It includes values. It can fail.
Examples also need ownership. Product owns customer behavior. Security owns abuse cases. Platform owns performance fixtures. Support owns real-world exception cases. The spec owner coordinates, but the examples should reflect cross-functional knowledge. That is how the spec captures reality the model cannot infer.
The chapter's rule: every consequential spec should include examples before implementation begins. If the team cannot write examples, it has not yet agreed on behavior. Do not ask the model to decide for you.
Tables are often better than prose
For business rules, tables can be the cleanest example format. A pricing feature may have dozens of combinations: plan type, region, coupon, renewal status, tax treatment, customer segment. Prose becomes unreadable. A table makes coverage visible.
| Plan | Coupon | Renewal state | Expected behavior |
|---|---|---|---|
| Monthly self-serve | Percent coupon | New purchase | Apply before tax |
| Monthly self-serve | Expired coupon | New purchase | Reject with reason |
| Annual enterprise | Any self-serve coupon | Renewal | Reject and route to account team |
| Monthly self-serve | Fixed credit | Mid-cycle upgrade | Apply to prorated subtotal only |
| Suspended account | Any coupon | Reactivation | Block until billing issue resolved |
The table is not a spreadsheet for its own sake. It is a compact set of examples that can become tests. A model can generate parameterized tests from it. Reviewers can see missing cases. Product can approve behavior without reading code.
Real examples beat invented examples
Synthetic examples are useful, but production examples are better. Support tickets, incident records, customer escalations, bug reports, failed sales promises, and reviewer disagreements should feed the example library. The best examples often come from the cases the team wishes were rare. They are the boundary where system behavior matters most.
A team can maintain a "golden examples" folder:
/specs/billing/examples/
cancellation-happy-path.json
cancellation-unpaid-invoice.json
cancellation-enterprise-contract.json
cancellation-provider-timeout.json
cancellation-legal-hold.json
Each file becomes test data, documentation, and model context. It also becomes a negotiation artifact. If product wants to change enterprise cancellation behavior, it changes the example first, then the implementation follows.
Example review as cross-functional alignment
Examples should be reviewed by the people who own the consequence. Security reviews abuse cases. Support reviews customer states. Finance reviews billing examples. Product reviews user outcomes. Engineering reviews whether examples are implementable and testable. This does not require a meeting for every small change, but consequential features deserve example review before code exists.
The model can generate proposed examples, but humans must approve them. Otherwise the model may define correctness for the organization.
The Example Library
A mature AI-native team accumulates an example library the way a conventional team accumulates unit tests. The library is not a random collection of sample inputs. It is a curated institutional memory of how the product is supposed to behave. It contains happy paths, edge cases, historical regressions, abuse cases, localization cases, migration cases, and customer-specific variants. It is one of the most valuable artifacts in the repository because it turns product judgment into executable evidence.
The first category is the canonical happy path. These examples teach the model and the human reviewer what normal looks like. They should be boring and representative. The second category is the boundary path: empty carts, expired subscriptions, missing fields, large uploads, old browsers, disabled accounts, canceled invoices. Boundary examples are where generic generated code most often fails because the common path looked complete.
The third category is the regression path. Every production incident that involved behavior mismatch should leave behind at least one example. If a generated change once allowed a user to see an unauthorized document, that exact shape of request belongs in the library. If a refactor once dropped a tax field from an invoice export, that export becomes a fixture. This is how the system remembers pain.
The fourth category is the abuse path. AI-generated systems often implement the cooperative version of a feature and omit the adversarial one. The example library should include users trying to exceed limits, change another tenant's data, bypass approvals, inject instructions into free-text fields, trigger double execution, and exploit race conditions. These examples are especially important because many models are trained on tutorials and clean examples, not hostile production traffic.
The fifth category is the business-rule path. These are examples that make sense only inside the company's domain. A healthcare workflow might include state-specific consent rules. A fintech workflow might include transaction holds and reporting thresholds. A marketplace might include seller suspension states and regional tax rules. These examples are where the company's judgment becomes hard to copy.
Examples should be written in formats that tools can consume: JSON fixtures, Gherkin scenarios, snapshot tests, structured YAML cases, or compact Markdown tables that can be converted into tests. The format matters less than the discipline: examples are versioned, reviewed, owned, and expanded after incidents. A spec without examples is negotiable. A spec with examples begins to execute.
From Examples to Generation Context
The example library should not wait passively for tests to run. It should be part of generation context. When asking a model to implement behavior, provide the relevant examples in compact form: representative inputs, expected outputs, and the reason each example exists. The reason field matters because it teaches intent. "Regression from incident INC-1842" carries different weight than "normal case." "Abuse case: cross-tenant access attempt" tells the model not merely what output to produce, but what boundary the output protects.
Teams should resist dumping the whole example library into every request. Context stuffing creates noise and cost, and it can cause the model to imitate irrelevant behavior. Instead, examples should be retrieved by domain, risk, and artifact type. A billing change gets invoice, tax, refund, and subscription examples. A permission change gets authorization examples and abuse cases. A UI copy change does not need database migration fixtures.
The library also supports review. A reviewer can ask: which examples prove this change behaves as intended? Which examples were added because of this change? Which historical examples still pass? A generated diff without a changed example is not always wrong, but a generated behavior change without example evidence should make the reviewer uneasy.
Over time, example coverage becomes a product asset. Competitors can copy architecture diagrams and use the same coding tools. They cannot easily copy the accumulated boundary cases produced by your customers, your incidents, your regulations, and your operational history. The example library is therefore not only a testing artifact. It is a machine-readable form of company knowledge.
