Agentic Design Patterns That Actually Work

The agentic design patterns that survive production are the bounded ones: tool-use with guardrails, plan-then-execute, reflection, and HITL at named decisions.

The agentic design patterns that survive production are the bounded ones: tool-use with guardrails, plan-then-execute with checkpoints, reflection scored by evals, and human-in-the-loop only at named decisions. Everything else is a demo waiting to embarrass you. Between the impressive prototype and the quiet production failure sits a thin set of patterns that hold, and they all share one trait: they constrain the agent more than they free it.

I have spent the last two years putting agents into real operational workflows, where a bad output lands on a real customer, not a slide. This piece sits one level below the pillar field guide on AI agents and agentic workflows, and I wrote the broader case in an honest accounting of what agents can do today. What follows is narrower: the reusable patterns I trust, the failure mode each one fixes, and the cost you pay to run it.

A pattern earns its place by closing a specific failure mode, not by making the architecture diagram look sophisticated.

Key takeaways

If you read nothing else, take these five claims with you:

Tool-use with guardrails is the base pattern. Give the agent exactly the tools it needs, least privilege, and validate every call.
Plan-then-execute with checkpoints beats open-ended autonomy when the steps are knowable. You get predictability and a place to stop.
Reflection only pays when a machine scores it. Self-critique without an eval is the agent grading its own homework.
Human-in-the-loop belongs at named decisions, not "everywhere," or it collapses into rubber-stamping.
Multi-agent autonomy is the pattern to reach for last. It can stretch latency from under a second to tens of seconds and adds a failure surface most teams cannot debug.

Workflows first, agents only when you must

The most useful distinction in agentic design is not a pattern at all. It is the line between a workflow and an agent. A workflow orchestrates LLM calls and tools through code paths you wrote. An agent lets the model decide its own next action at runtime. Anthropic draws this line cleanly in Building Effective Agents, and their advice is blunt: most teams should use composable workflow patterns, and reach for full autonomy only when the task genuinely needs it.

This matters because autonomy is the expensive ingredient. The moment the model picks its own next step, you lose the predictability that makes traditional software debuggable. So the first agentic design pattern is a decision, not code: can I express this as a fixed sequence of steps? If yes, write the workflow. You will ship something that holds at 3am instead of something that demos well at 2pm. If you want a team that has drawn this line in production before, that is the kind of work the engineers you hire to build AI systems do every day.

When it works: any task where the steps are knowable in advance. Classification, extraction, routing, summarize-then-act. The failure mode: teams reach for an autonomous loop because it feels modern, then spend a quarter debugging behavior they could have hard-coded in an afternoon. Concrete example: a "research agent" that fetches three known sources, summarizes each, and merges them is a workflow. Calling it an agent does not make it one, and pretending otherwise just adds tokens.

Tool-use with guardrails: the base pattern

Tool-use is the foundation every other pattern builds on. The agent calls a function, reads the result, and decides what to do next. It appears in nearly every production system because it is how the model touches the world. The pattern that survives is tool-use with guardrails: least-privilege tool scopes, schema validation on every call, and a hard cap on iterations.

The guardrails are the whole point. An agent that summarizes documents should not hold a write key to the document store. An agent that drafts replies should not hold a send key. This is the principle of least privilege applied to a system that will, eventually, do something you did not predict. You are not preventing the surprise. You are shrinking its blast radius.

When it works: the task resolves in a single LLM call plus a few well-defined tools, with each tool's output checkable against a schema. The failure mode: the model hallucinates a tool result, or invents an argument the API never accepts, and proceeds as if the fiction were real. Concrete example: a fintech lead I worked with, Priya, ran a fraud-flagging agent that read a transaction, called a read-only risk API, and wrote a flag to a review queue, never touching the ledger. In its first month it mis-scored roughly 40 transactions out of 12,000, but because every flag landed in a human-reviewed queue and the agent held no write key, not one of those errors moved a cent. The blast radius was a reviewer's extra two minutes, not a refund.

# Guardrail, not vibes: validate before you trust the tool call

if tool_call.name not in ALLOWED_TOOLS:

raise ToolScopeError(tool_call.name)

result = run(tool_call)

if not schema.validate(result):

escalate(reason="tool output failed schema")

Plan-then-execute with checkpoints

When a task has multiple steps with dependencies, plan-then-execute beats open-ended reasoning. The agent produces an explicit plan first, then executes it step by step, pausing at checkpoints where the cost of a wrong step is high. The plan is inspectable before any action runs, which is the property that makes the pattern safe.

The classic open-ended version is ReAct, where the model interleaves reasoning and acting in a loop, introduced in the ReAct paper in 2022. ReAct is powerful and worth knowing. In production I bound it: an explicit iteration limit, a written plan the agent commits to, and checkpoints where execution stops for a check before continuing. Unbounded loops are how agents quietly burn tokens and drift off-goal.

When it works: multi-step tasks where the decomposition is knowable up front, like a data migration or a multi-stage report. The failure mode: without checkpoints, the agent commits step three before anyone notices step two was wrong, and now the error has propagated. Concrete example: a billing-reconciliation agent plans five steps, executes the first two against a staging copy, and stops at a checkpoint before any step that writes to the production ledger. The reversible steps run free. The irreversible one waits.

Reflection, but only with evals

Reflection is the pattern where the agent critiques its own output and revises it. Done right, it reduces hallucinations and improves reasoning depth. Done wrong, it is the agent grading its own homework and giving itself an A. The difference is whether a machine, not the model, scores the work.

My rule is simple. Reflection only counts when the critique is anchored to something external the agent cannot talk its way past: a test suite that runs, a schema that validates, a retrieval check against ground truth, an eval harness that returns a number. A code agent that writes a function, runs the tests, reads the failures, and fixes them is using reflection correctly, because the tests are the judge. A writing agent that "reviews its tone" and declares itself satisfied is using reflection as theater. This is the same discipline I argue for in the guide to evaluating AI agents: a human or a model saying "looks good" is not a measurement.

When it works: any output you can verify mechanically, where a failing check gives the model real signal to revise against. The failure mode: reflection without an external scorer inflates confidence while changing nothing, and a Reflexion loop that runs ten cycles can burn 50x the tokens of a single pass for no measurable gain. Concrete example: an extraction agent pulls fields from an invoice, validates them against a schema and a known total, and re-extracts only the fields that fail. The check decides when it is done, not the model's opinion of itself.

Reflection without an external scorer is the agent grading its own homework. Vibes are not evals, and self-critique is not verification.

Human-in-the-loop, only at named decisions

"Put a human in the loop" feels safe and scales terribly. The reviewer becomes a bottleneck, then a rubber stamp, then a liability. I take the full version of this argument further in my book Human in the Loop Is Not a Plan. The pattern that survives is narrow: a human is in the loop at named decisions, not everywhere.

A named decision is a specific, written checkpoint where the agent must stop and get a human verdict. Anything irreversible. Anything that touches money, a legal commitment, or a customer relationship. Everything else the agent handles, and routes only the named cases to a person with enough context to act fast. "A human reviews everything" produces a reviewer who reads the first line and approves on autopilot. "A human approves any refund over $500" produces oversight that actually happens.

When it works: high-stakes, low-frequency decisions where human judgment adds more than it costs. The failure mode: blanket review collapses into rubber-stamping under load, and the institutional weight of "human-approved" gets attached to outputs no human meaningfully read. Concrete example: a support team I advised, run by a lead named Daniel, let its agent draft and send routine replies on its own, but any message that mentioned a refund, a legal threat, or a churn risk stopped for a named human approval before it went out. The result was a clean split: the agent handled about 2,000 easy tickets a day, and Daniel's team reviewed only the 20 hard ones that actually needed judgment. The oversight that survived was the oversight that fit a human's day.

When NOT to use an agent

The most valuable pattern is knowing when to use no agent at all. If the task is fully specifiable, a workflow or plain code is cheaper, faster, and easier to debug. If the action is irreversible and high-stakes, do not hand it to an autonomous loop. If you cannot write a check for "good output," you cannot evaluate the agent, and an agent you cannot evaluate is a liability you have not measured yet.

Multi-agent systems deserve special skepticism. Orchestrating several specialized agents looks elegant on a whiteboard and behaves badly in production. The hidden economics of agents are unforgiving: a single LLM call that returns in about 800 milliseconds can balloon to 10 to 30 seconds once you wrap it in a multi-agent orchestrator with a reflection loop. The same analysis shows a ten-cycle reflexion loop can burn 50 times the tokens of one linear pass, and an unconstrained coding agent can run $5 to $8 per task.

That cost is a revenue decision, not a footnote. An agent that burns even 50 cents per task and runs a million times a month is a $500k annual line item before it earns a dollar. The narrow band where autonomy earns its keep is defined as much by unit economics as by capability. Reach for the simplest pattern that closes the failure mode, and reach for multi-agent autonomy last, if at all.

Composing the patterns

Real systems are not one pattern. They compose two or three, each fixing a different failure. A production agent I trust looks like this: a bounded tool-use core, wrapped in a plan-then-execute structure with checkpoints, with reflection scored by evals on the verifiable steps, and human-in-the-loop at the two or three decisions that are irreversible. Tool-use gives it reach. Planning gives it predictability. Reflection gives it self-correction with a real judge. The human handles the edges no system should own.

The order matters. Add patterns in response to failure modes you have observed, not requirements you imagine. Every pattern you add costs tokens, latency, and debugging surface. The discipline is to add only what closes a real gap, then stop. This is the same loop I describe in the AI-Native thesis: the machine does the work, and the human's job narrows to judgment, specifying what the agent may do and evaluating what it did. The patterns are how you make that judgment enforceable in code. The deeper version of this framework, with production examples and the failure modes behind each pattern, is in my book Agents That Actually Work.

Frequently asked questions

What are the main agentic design patterns?

The patterns that hold in production are tool-use with guardrails, plan-then-execute with checkpoints, reflection scored by evals, and human-in-the-loop at named decisions. Multi-agent orchestration is a fifth, used last because it multiplies cost and latency. Most reliable agents compose two or three of these, not all five.

How do I build a reliable AI agent?

Start with the simplest pattern that closes your failure mode, usually bounded tool-use. Give the agent least-privilege tools, validate every call against a schema, cap its iterations, and add an external eval before you trust its self-critique. Add planning and human checkpoints only where an observed failure justifies them. The step-by-step version lives in the guide to how to build AI agents.

When should I not use an agent?

When the task is fully specifiable, write a workflow or plain code instead, because it is cheaper, faster, and debuggable. Avoid autonomous agents for irreversible, high-stakes actions, and avoid any agent whose output you cannot check mechanically. If you cannot write an eval for it, you cannot trust it in production.

What do agentic workflows actually cost to run?

More than the demo suggests. A single call that returns in under a second can stretch to tens of seconds once it is wrapped in a multi-agent orchestrator with a reflection loop, and a long reflexion loop can burn many times the tokens of one pass. Treat cost and latency as design constraints from day one, not problems to fix after launch.

If you are building agents and want a team that ships them with guardrails and evals from day one, that is the work Devlyn does on AI observability and monitoring. The patterns above are the starting point. The discipline to apply only the ones a real failure mode demands is what separates an agent that holds from one that quietly breaks.