AI Agents and Agentic Workflows: An Honest Field Guide

Agentic workflows let AI agents take actions toward a goal in a loop. They earn their keep in a narrow band - here is exactly where, and where they fail.

An AI agent is a system where a language model is given a goal, a set of tools, and the freedom to decide its own next action based on what it observes. An agentic workflow is that loop running toward an outcome: the model plans, calls a tool, reads the result, adjusts, and repeats until it stops. The difference from a chatbot is simple. A chatbot answers. An agent acts.

That distinction is the whole story. Generation produces text you read. Agentic workflows produce actions with consequences. And actions are where the trouble lives. I have spent two years putting these systems into production, and the honest truth is this: agents earn their keep in a narrow band of tasks, and that band is smaller than the demos suggest.

This is the field guide I wish someone had handed me before I shipped my first agent. It defines the terms, draws the line between agentic and generative AI, names the failure modes nobody markets, and walks the design patterns that actually hold at 3am. I write it from the seat where engineering meets revenue, because the question "should this be an agent?" is always both a technical and a P&L question.

A chatbot answers. An agent acts. Everything hard about agentic workflows comes from that one word: act.

Key takeaways

If you read nothing else, read these.

Agentic workflows are loops, not prompts. The model decides its next action, calls tools, and iterates toward a goal. Autonomy is the point and the problem.
Reliability compounds downward. A 1% per-step error rate becomes a 63% chance of failure across 100 steps. Chain five agents at 95% each and end-to-end success drops to about 77%.
The narrow band is bounded, reversible, verifiable, and tool-scoped. Remove any one property and the agent fails quietly.
"A human reviews it" is not a plan. You evaluate an agentic workflow with an eval harness on the trajectory, not just the final answer.
Agents trade latency and cost for autonomy. Sometimes that trade pays. Often a single well-prompted model call is cheaper and more reliable.

What are AI agents and agentic workflows?

An AI agent is software that pursues a goal by taking actions in a loop, using a language model to decide each step. Agentic workflows are the orchestrated runs of that loop across tools, data, and time. The model is the reasoning engine. The tools are how it touches the world: APIs, databases, file systems, other models.

The word "agent" has been stretched past meaning by marketing. So I use a strict test. If the system follows a fixed code path that a developer wrote, it is a workflow. If the model itself chooses what to do next at runtime, it is an agent. Anthropic draws the same line in its guide to building effective agents: workflows orchestrate models through predefined paths; agents let the model direct its own process. Most production systems are workflows with a thin agentic layer, and that is usually the right design.

The reason the distinction matters is control. A workflow fails the way ordinary software fails: a missing field, a timeout, an unhandled exception. An agent fails in ways that surprise you. It pursues the wrong goal with confidence. It loops. It hallucinates a tool output and proceeds as if it were real. Understanding that failure surface is the difference between a useful deployment and one that breaks in front of a customer, and it is exactly the gap a team that has shipped agents in production knows how to close, the kind you get when you hire AI engineers who have done it before. I made the broader case for this in my essay on an honest accounting of what agents can do today. The principles of building AI agents follow directly from that failure surface.

Agentic AI vs generative AI: output vs action

Generative AI produces an artifact: a paragraph, an image, a function. You read it, judge it, and decide what to do. Agentic AI takes the action itself. That single shift, from producing output to executing action, is what introduces risk that generation never had.

When a model drafts an email, the worst case is a bad draft you discard. When an agent sends the email, the worst case is a sent email you cannot recall. Generation is reversible by default because nothing happens until a human acts on it. Agentic workflows close that gap, and in closing it, they remove the human checkpoint that quietly caught most errors.

Generation gives you a draft you can throw away. An agent already pressed send. The risk is not in the model - it is in the action you let the model take.

This is why I treat "agentic" and "generative" as different engineering problems, not points on a spectrum. Generative systems need good prompts and good evals on output quality. Agentic systems need all of that plus guardrails on actions, observability on reasoning, and a recovery plan for when the loop goes wrong. The fuller comparison, with the specific failure modes, sits in the supporting piece on agentic AI versus generative AI. For the pillar, hold onto one idea: action carries consequence, and consequence is what you design around.

When agents actually work, and when they don't

Agents earn their keep on tasks with four properties: bounded, reversible, verifiable, and tool-scoped. This is the honest core of the guide, because nobody selling an agent platform will tell you where their product fails.

Bounded means a clear start and stop. "Triage these 200 tickets and flag refund cases for review" is bounded. "Run our support queue" is not. Reversible means a mistake can be undone fast. Drafting is reversible; sending and charging are not. Verifiable means you can check the output mechanically, not by squinting at it. A test suite passes or fails. A schema validates or does not. Tool-scoped means least privilege: the agent holds exactly the tools it needs and no send key, no write access to the primary, nothing whose blast radius you cannot afford.

Remove any one of those and you leave the band. An agent that is bounded, verifiable, and tool-scoped but not reversible is one bad step from a real incident. The math is unforgiving here. Carnegie Mellon's TheAgentCompany benchmark found the best model completed only 24% of realistic multi-step office tasks autonomously, with failure rates near 70% as complexity rose. Production data tells the same story: an analysis of thousands of deployed agents reported a 56.6% success rate across millions of real runs.

The reason is compounding. Reliability multiplies down a chain. A 1% error per step becomes a 63% chance of at least one failure across 100 steps. Wire five agents together at 95% reliability each and your end-to-end success falls to roughly 77%, as the multi-agent reliability math shows plainly. Longer chains are not more capable. They are more fragile.

So the practical rule: keep the autonomous span short, the actions reversible, and the checks mechanical. The tasks I have watched succeed are first-pass triage, structured data extraction with schema validation, draft generation a human edits before sending, and advisory monitoring that reports an anomaly rather than fixing it. The tasks that fail are long-horizon research, ambiguous planning, anything negotiating with a real human counterparty, and anything irreversible in money or law. The supporting pieces on agentic AI use cases and shipping agentic AI examples go deeper, but the test above will save you most of the pain.

A founder I advised, Maya, wanted an agent to "run onboarding" for her SaaS: read each new signup's setup, configure their workspace, email them, and book a call. It demoed beautifully and broke in week two, when it auto-configured a customer's billing tier from a misread field and emailed them a wrong quote. We cut it down to one bounded, reversible job: draft the onboarding email for a human to send. That version has run for months without an incident. The whole task was an agent; the part that earned its keep was a sliver of it.

Before committing engineering time, I run a candidate task through five questions. Can I write the stop condition in one sentence? If a step goes wrong, can a person undo it in under five minutes? Can I check the output with code rather than a careful read? Does the agent need only tools whose worst case I can absorb? And is there a single decision in here I would never delegate, that I can carve out for a human? A task that clears all five is in the band. A task that fails two or more is a workflow, a single model call, or a project for later, not an agent.

Agentic design patterns that hold in production

A handful of patterns survive contact with production. They share one trait: each bounds the model's autonomy at the point where autonomy is dangerous.

Tool use with structured schemas. The model selects a tool, you validate the parameters, you execute, you feed the result back. The schema is the guardrail. A malformed call is caught before it touches anything.
Plan-then-execute with checkpoints. The model drafts a plan, a check (sometimes a human, sometimes a rule) approves it, then execution proceeds step by step. The plan is reviewable before any action lands.
Reflection tied to an eval, not a vibe. The model critiques its own output against a concrete check before proceeding. Reflection only helps when the critic has a real rubric. Self-review against "does this seem good?" mostly launders confidence.
Human-in-the-loop at named decisions. Not a human watching everything, which scales terribly. A human at the specific decisions you decided in advance you will never delegate.

That last one is the discipline most teams skip. Putting a human in the loop everywhere turns the reviewer into a bottleneck, then a rubber stamp, then a liability. I argue the full version of that trap in my book Human in the Loop Is Not a Plan. The fix is to name the decisions that always require judgment, and let the agent own only what falls outside that list. The supporting catalog of agentic design patterns carries every variant; the patterns above are the load-bearing ones.

Two anti-patterns deserve a warning, because they look sophisticated and fail expensively. The first is the open-ended multi-agent swarm: a planner spawning sub-agents that spawn more sub-agents. It demos beautifully and collapses under the compounding math above, because every hop multiplies another reliability discount onto the result. The second is reflection without a ground truth. An agent asked to grade its own work against no external check tends to approve itself, since the same model that produced the error rarely catches it. Reflection earns its place only when the critic holds a test, a schema, or a rule the generator does not control.

The patterns that hold share a shape worth naming: they make the model's choices observable and reversible at the moment of action. Tool schemas catch a bad call before execution. Plan checkpoints catch a bad plan before any step lands. Named human decisions catch the consequences no system should own. You are not trying to make the agent never err. You are trying to make sure that when it errs, the error is cheap, visible, and recoverable.

How to build an AI agent: spec, tools, guardrails, evals

The build loop that holds has four stages in order: write the spec, give it tools, wrap it in guardrails, then evaluate before you widen its scope. Skipping any stage is how a demo dies in production.

Spec first. Write down the goal, the allowed actions, the stop condition, and the decisions the agent may never make alone. The spec is the program here, a point I argue in my AI-Native thesis that the machine does the job and the human evaluates. If you cannot specify the stop condition, you are not ready to build.

Tools, scoped tight. Give the agent exactly what the spec requires. Read replica, not primary. Draft endpoint, not send. Every tool you add expands the blast radius, so add deliberately.

Guardrails on every action. Validate tool inputs against schemas. Cap the loop length. Add a kill switch. Check for prompt injection on anything the agent reads from the outside, because untrusted content in an agent's context is an attack surface, not just data.

Evals before scope. Build the eval harness before you extend autonomy, not after the first incident. Then a run produces a trace you can grade.

# A graded agent run, scored on the trajectory, not just the answer

RUN ticket-triage agent=v3 model=mid-tier steps=6

step 1 tool=search_kb ok latency=420ms

step 2 tool=classify ok conf=0.91

step 3 tool=check_refund ok matched=true

step 4 action=flag_for_human ok # correct: never auto-decide refunds

step 5 tool=draft_reply ok latency=1180ms

step 6 action=stop ok

EVAL trajectory="pass" cost=$0.011 p95_latency=2.3s human_escalation=correct

Notice what the eval grades: the path, the cost, the latency, and whether the agent escalated the one decision it was told never to make alone. A final answer that looks right with a broken trajectory is a regression waiting to happen. The supporting walkthroughs on how to build AI agents and agentic coding go line by line; this is the shape of the loop.

Frameworks and tooling: a neutral survey

The framework you pick matters less than the discipline you bring. That said, the 2026 landscape has settled into a few credible choices, each with a different center of gravity.

LangGraph leans into graph-based control flow, persistence, and audit trails. It maps cleanly to production needs like rollback points, which is why it leads on enterprise adoption per the open-source framework surveys.
CrewAI centers multi-agent collaboration and has broad protocol support. Reasonable when several specialized agents genuinely need to coordinate, though remember the compounding-reliability tax on every added hop.
OpenAI Agents SDK optimizes for the fastest path to something running and works across many models. Good for a first prototype.

My honest take: most teams reach for a multi-agent framework when a single well-scoped agent, or even a plain workflow, would be more reliable and far cheaper. Gartner projects 40% of enterprise applications will feature task-specific agents by the end of 2026, up from under 5% in 2025. A lot of those will be frameworks solving problems the team did not have. Pick the simplest tool that meets the spec, and add complexity only when an eval shows you must. The supporting agentic AI frameworks comparison weighs each on what it costs you in practice, and the rundown of the best AI agents shipping today shows which choices actually hold.

Memory, retrieval, and context for agents

Memory is the part teams treat as an afterthought and then get burned by. An agent that runs across sessions needs an explicit memory architecture, or it forgets what it cannot fit in context and invents what it cannot recall.

Split memory into three kinds. Working memory is the live context window, handled by prompt design. Episodic memory is the structured log of past runs: what happened, when, with what result. Semantic memory is durable facts and preferences that should persist. Working memory is free. The other two are infrastructure decisions about where data lives, how it is indexed, and how stale entries get retired. I break down each layer in the supporting guide to memory systems for agents, and my book Memory Systems for Agents is the most rigorous treatment of this I have found.

Retrieval is where agentic systems meet RAG, and where they inherit RAG's failure modes. The demo retrieves perfectly. Then the corpus grows, the queries drift, and recall quietly collapses over the following months. Agentic RAG, where the agent decides what to fetch, can beat static retrieval on hard queries, but it adds latency and another failure surface. Use it when the query genuinely needs iterative search, not by default. If your knowledge layer is the bottleneck, that is exactly the work Devlyn does on RAG and knowledge integration.

Evaluating agents: a human reviews it is not a plan

You evaluate an agentic workflow by grading its trajectory against an eval harness, not by reading the final answer and nodding. Multi-step, tool-using systems fail in the middle, where a quick glance never looks. The harness is the only thing that scales with autonomy.

A real agent eval checks several layers. Did each tool call succeed with valid parameters? Did the agent escalate the decisions it was required to escalate? Did it stay inside cost and latency budgets? Did the trajectory match an acceptable path, even when the final output looked fine? This is the same discipline I argue for in building evals that predict production, applied to a moving target.

Vibes are not evals. If you cannot write a check for what the agent did, you are not evaluating it. You are hoping.

The reliability numbers above exist because most teams ship without this harness and discover the failure modes from customers. A multi-dimensional study of enterprise agentic systems measured a 37% gap between lab benchmark scores and real deployment performance, and a 50x cost spread across agents hitting similar accuracy. An eval harness is how you find out which side of that gap you are on before you scale. If you would rather have observability and evals built in from day one than bolted on after an incident, that is the work Devlyn does on AI observability and monitoring.

What agents cost: the latency budget and the revenue lens

Every agentic workflow trades latency and cost for autonomy. Anthropic says this plainly in its agents guide, and it is the trade most teams price wrong. A loop that calls a model six times costs roughly six times a single call, plus the latency stacks. Each step is a chance to be slow and a chance to be wrong.

From the revenue seat, this changes the calculus. An agent that resolves a ticket in 40 seconds and $0.04 may beat a human on cost. The same agent that takes four minutes and $0.40 because it looped on a hard case may lose money and a customer. The right question is never "can an agent do this?" It is "can an agent do this inside the latency and cost budget the business can afford?"

This is also why the biggest model is rarely the right one. Revenue rewards the model you can afford to run, ship, and explain, not the one that tops a benchmark. Route the easy steps to a cheap model and reserve the expensive one for the steps that need it. If you want a team that ships agentic workflows with this cost discipline built in, that is what Devlyn's engineers do, and it is the core argument of my book Agents That Actually Work.

Frequently asked questions

What is an AI agent?

An AI agent is software that pursues a goal by taking actions in a loop, using a language model to decide each next step from what it observes. It calls tools, reads results, adjusts, and repeats until a stop condition. The defining trait is that the model chooses its actions at runtime, rather than following a fixed path a developer wrote.

What is the difference between agentic AI and generative AI?

Generative AI produces an artifact you read and judge, like text or an image. Agentic AI takes the action itself, like sending the email or updating the record. The shift from producing output to executing action is what introduces risk: generation is reversible because nothing happens until a human acts, while agentic workflows remove that checkpoint.

Are AI agents reliable?

In a narrow band, yes. Broadly, not yet. Reliability compounds downward, so a small per-step error rate becomes a large end-to-end failure rate over many steps. Production data shows real-world success rates around 56% across diverse agents, well below demo performance. Agents are reliable on bounded, reversible, verifiable, tool-scoped tasks and unreliable outside that band.

When should you not use an AI agent?

Avoid an agent when the task is unbounded, the actions are irreversible, the output cannot be checked mechanically, or the tools are too broad to contain. Skip it too when a single model call or a fixed workflow would do the job more cheaply and reliably. Irreversible financial or legal actions should never run autonomously.

How do you evaluate an AI agent?

Grade the trajectory, not just the final answer, with an eval harness. Check whether each tool call succeeded with valid inputs, whether the agent escalated the decisions it was required to escalate, and whether it stayed inside cost and latency budgets. Build this harness before you widen the agent's autonomy, not after the first incident.

What is the best framework to build agents?

There is no single best framework; pick the simplest one that meets your spec. LangGraph suits complex, auditable control flow; CrewAI suits genuine multi-agent collaboration; the OpenAI Agents SDK suits the fastest prototype. Most teams overreach for multi-agent frameworks when one well-scoped agent, or a plain workflow, would be more reliable and cheaper.

Where this leaves you

Agentic workflows are real and valuable in the narrow band where tasks are bounded, reversible, verifiable, and tool-scoped. Outside it, autonomy compounds risk faster than it adds value. Build the spec, scope the tools, wrap the guardrails, and grade the trajectory before you widen the loop. That sequence is the difference between a system that holds and a demo that embarrasses you.

If you are deciding whether a process should be an agent at all, the deeper version of this argument lives in my book Agents That Actually Work, which goes through the narrow-band framework with production examples. And if you want a team that ships agentic workflows with evals and cost discipline built in from day one, hire AI engineers who have done it in production. The honest path is the faster one. Build for the narrow band, prove it with evals, and extend only when the numbers say you can.