Name: Agents That Actually Work
Availability: InStock

An agent that aced the demo broke production because one tool had side effects, one response was ambiguous, and nobody could find where the plan changed.

The Demo That Lied is the agent failure pattern where a clean demo hides side effects, ambiguous instructions, and missing traceability until production exposes them.

Key Takeaways

Agent demos fail when they prove happy-path fluency but ignore side effects, ambiguity, and replay evidence.

A production agent needs bounded goals, tool contracts, observable state, narrow autonomy, and designed de-escalation before launch.

Benchmarks and demos are useful only when they predict the workflow, permissions, and failure modes the real system will face.

The book treats autonomy as a liability that must earn its place, not a feature to grant by default.

Read this beside what an agent is and is not, the BOUND Test, and how to build AI agents when you turn the chapter into a production design.

The demo was beautiful. A support engineer typed one sentence into a chat box: "Customer 48213 says their invoice is wrong, fix it." The agent read the ticket, queried the billing API, found a duplicate line item, issued a credit, drafted an apology email, and closed the ticket. Eleven seconds. The room clapped. Someone said the word "headcount" out loud, which is how you know a demo landed.

We shipped it to a controlled pilot four weeks later. On the third day it issued a $14,000 credit to an enterprise account that had not been double-billed. The customer was delighted and silent. Finance was neither.

When we pulled the thread, three things had gone wrong at once, and not one of them was the model "being dumb." First, the billing API's applyCredit endpoint had side effects we never classified: it wrote to the ledger, triggered a webhook to the dunning system, and was not idempotent, so a retry doubled the amount. Second, the API had returned a field called disputed_amount that was null for this account, and null serialized into the prompt as the string "null," which the model read as a number near zero, then ignored, then substituted the full invoice total because that was the only dollar figure left in context. Third, and this is the one that ended the meeting, nobody could tell us where the plan changed. We had logs. We did not have a trace. We had the final action and the first instruction and a thirty-thousand-token black hole in between.

That black hole is the subject of this book.

The enemy

There is a genre of writing about AI agents that treats autonomy as a feature you add, like dark mode. Give the model more tools, a longer leash, a bigger context window, and a loop, and it becomes an "autonomous worker." The pitch is always framed as a ceiling raise: look how much it can do now. The pitch almost never mentions the floor: what happens when it is wrong, which it will be, in production, against real systems, with money and reputations attached.

I have sold software, run revenue teams, debugged incidents at 2 a.m., and sat in the meeting where someone explains to a customer why an automated system did something irreversible. So I will be blunt about the enemy this book is fighting: agent hype that treats autonomy as inherently valuable and ignores tools, state, permissions, traceability, cost, and failure paths.

Autonomy is not a feature. It is a liability that must earn its place. That sentence is the motif of the entire book, and I will repeat it until it is annoying, because the industry has spent three years repeating the opposite.

Infographic map for Introduction: The Demo That Lied — The figure maps the incident pattern behind this introduction: an agent that aced the demo broke production because one tool had side effects, one response was ambiguous, and nobody could find where the plan changed.

The thesis

Here is the claim, stated plainly so you can argue with it:

Agents work best when autonomy is narrow, tools are reliable, state is observable, and failure has a designed landing zone.

Read that again and notice what it does not say. It does not say agents are good or bad. It does not say more autonomy is better or worse. It says autonomy has preconditions, and when those preconditions are absent, the autonomy you added is not capability. It is exposure.

The $14,000 credit was not an intelligence failure. The model reasoned fine given what it saw. It was a systems failure: an unclassified side effect, an unobservable state transition, an un-budgeted retry, and a missing landing zone for the case where the agent is confidently wrong. Every one of those is something a team controls. None of them is something you fix by waiting for a smarter model.

This is good news, actually. It means the question "will agents work?" is the wrong question. The right question is "have you built the conditions under which this particular agent can be trusted with this particular task?" That question has answers. This book is a set of them.

What this book is

This is a field manual. It is built to be implementation-ready, not inspirational. You will find task contracts, tool-permission schemas, state and checkpoint patterns, retry and budget controls, evaluation harnesses for agent runs, trace schemas, escalation paths, kill-switch runbooks, and failure taxonomies. Where pseudocode or config makes the idea concrete, you get pseudocode or config. Where a decision table beats a paragraph, you get the table.

It is organized around four reusable frameworks that show up in every chapter:

The BOUND Test, five gates that decide whether a task deserves an agent at all: Bounded goal, Observable state, Usable tools, Narrow autonomy, Designed failure.
The Autonomy Ladder, six rungs from script to autonomous agent, with promotion criteria so you climb on evidence rather than enthusiasm.
The Tool Trust Contract, the seven properties every tool must declare before an agent is allowed to call it: permission, input schema, side effects, idempotency, rollback, rate limits, audit logging.
The Agent Incident Chain, the five-link path by which agent failures actually propagate: bad instruction, bad plan, wrong tool, irreversible action, hidden failure, with a control at every link.

These are not acronyms chosen because they spell something. They are checklists I wish I had owned before the demo that lied.

What this book is not

I want to disappoint the right readers early, so nobody finishes annoyed.

This is not a list of agent frameworks. LangGraph, AutoGen, CrewAI, the SDK your vendor shipped last month: I will reference them where their behavior teaches something, but the half-life of a framework comparison is about a quarter, and the principles outlive the libraries. If you came for a bake-off, this is the wrong book.

This is not a thirty-minute "build an autonomous worker" tutorial. Those tutorials work because they run once, in a clean sandbox, against a toy. Production is the opposite of all three conditions.

This is not a prompt cookbook. Prompts matter, but a clever prompt wrapped around an unclassified side effect is a faster way to lose $14,000.

This is not a philosophical book about AGI. I do not know when or whether machines will think. I know that next Tuesday someone on your team will decide whether a support workflow gets an agent, and I want them to decide well.

And this is not a claim that agents replace workflows everywhere. Most things that look like agent problems are workflow problems wearing a costume. A large part of this manual is about telling the difference, because the most expensive agent is the one you built for a job a state machine would have done deterministically, cheaply, and without a $14,000 surprise.

Who should read it

Engineers building tool-using LLM systems, who need schemas and patterns rather than vibes. Product managers deciding whether a given workflow needs an agent, who need a decision tree they can defend in a planning meeting. Automation teams replacing brittle scripts, who already know what brittle feels like and do not want to trade it for unpredictable. Support, sales, finance, legal, and operations leaders piloting agents, who carry the blast radius. Platform and MLOps teams responsible for traces, evals, and incidents, who will own the black hole when it appears. And founders building agentic products, who need to know which promises are safe to make.

If you are technical, the artifacts will be usable as-is. If you are not, the frameworks will give you the right questions to ask the people who are, and the right answers to refuse.

How to read it

The chapters are sequenced, but they are not a story with a twist, so read what you need. If you are deciding whether to build an agent at all, start with the BOUND Test and the workflow-versus-agent comparison. If you have already shipped something and it scares you, jump to observability, incidents, and kill switches, then come back and do the BOUND Test you skipped. If you are an engineer, the tool, state, planning, and security chapters are where the load-bearing artifacts live.

Throughout, I label hypotheticals as hypothetical and cite primary sources inline where a claim depends on research or public engineering practice. When I say ReAct established the interleaved reason-then-act loop most agents still use, I link the paper. When I say excessive agency is a named, current security risk, I link the OWASP Top 10 for LLM Applications. I will also be honest about where the benchmarks are weaker than their headlines, because an evals chapter that takes leaderboards at face value would be malpractice.

The promise

By the end, you will be able to do three things you probably cannot do cleanly today.

You will be able to decide, with a worksheet rather than a hunch, when a task deserves an agent and when it deserves a workflow, a copilot, or a plain script.

You will be able to design bounded autonomy that survives contact with production: tools that declare their side effects, state you can observe and replay, plans that cannot wander forever, failures that land somewhere safe, and a switch you can actually flip.

And you will be able to tell a real incident story afterward, because you will have the trace, the plan, the tool calls, and the exact link in the chain where it went wrong.

The agent in the demo looked smart right up until it touched a real system. The agents that actually work are not the ones that look smartest. They are the ones that are the most accountable. We start by being precise about what an agent even is, because most of the things people call agents are not, and that confusion is where the trouble begins.

Introduction: The Demo That Lied