An honest accounting of what agents can do today

Between the demos and the disappointment lies a narrow band of tasks where agents genuinely earn their keep.

There is a version of this essay I could write that would make everyone feel good. I could tell you that autonomous agents are about to transform every knowledge-work function, that your competitors are deploying fleets of them right now, that the window to act is closing. People write that essay every week on LinkedIn. I am going to write the other one.

I have spent the last two years deploying production AI systems, not in controlled demos, not in side projects, but in real operational contexts where failures land on real people. At Devlyn we run AI-assisted workflows across customer service, operations, and product. Before that, as CTO and COO of earlier companies, I was responsible for the systems that kept things running. What I know about agents comes from the full arc: the promising prototype, the embarrassing incident, the slow grind of making something reliable enough to trust with real work.

The honest truth about agents in mid-2026 is that they work in a narrow band. That band is real and valuable, I am not writing a pessimist's screed. But it is narrower than the demos suggest, and understanding its contours is the difference between deploying something useful and deploying something that silently fails in ways you will not catch until the damage is done.

What "agent" actually means in production

The word agent has become almost meaningless from overuse. Let me define what I mean. An agent is a system where a language model is given a goal, a set of tools, and the ability to decide what to do next based on what it sees. It is not a single prompt-response pair. It loops. It calls tools, observes results, updates its plan, calls more tools. It can run for minutes or hours. It might send emails, write files, call APIs, escalate to humans.

That multi-step autonomy is both the point and the problem. The moment you let a model decide its own next action, you have introduced a control surface that behaves differently from anything you have shipped before. Traditional software fails in ways that are mostly predictable, a missing field, an unhandled exception, a timeout. Agents fail in ways that are surprising. They confidently pursue the wrong goal. They get stuck in loops. They hallucinate tool outputs and proceed as if the hallucination were real. They take an action that cannot be undone and then, when they realize it was wrong, report success anyway because they have no mechanism to surface the error honestly.

I am not saying this to scare you off agents. I am saying it because every deployment decision downstream of this depends on understanding the failure mode. The narrow band of tasks where agents genuinely earn their keep is defined almost entirely by how well you can contain and recover from that failure mode.

The narrow band: what actually works

The tasks where I have seen agents deliver consistent, trustworthy value share four characteristics. They are bounded, reversible, verifiable, and tool-scoped.

Bounded means the task has a clear starting state and a clear ending state. Not "handle our support queue", that is unbounded and perpetual. But "triage this batch of 200 support tickets and draft a prioritized response queue, flagging anything that matches our refund policy for human review", that is bounded. The agent runs, produces output, and stops. You look at the output. You decide what to do with it.

Reversible means that if the agent made a mistake, you can undo it without catastrophic downstream consequences. Drafting is reversible. Sending is not. Updating a record in a staging environment is reversible. Posting to a production database that triggers billing is not. I have a rule of thumb that I apply ruthlessly: if an agent action cannot be undone in under five minutes by a senior engineer without external dependencies, it is not a candidate for autonomous execution. It either needs explicit human approval, or it should not be in the agent's tool scope at all.

Verifiable means you can check the output mechanically, not just by reading it and thinking it seems fine. For a code generation agent, the test suite runs and passes or fails, that is verification. For a data extraction agent, the output schema is validated and the records cross-checked against known totals. For a document summarization agent, I can do spot checks and compare against human-generated summaries. "Seems good" is not verification. Vibes are not evals. If you cannot write a check, you are flying blind.

Tool-scoped means the agent has access to exactly the tools it needs and nothing more. This is the principle of least privilege applied to AI systems. An agent that summarizes documents should not have write access to the document store. An agent that drafts email responses should not have a send key. An agent that queries a database for analytics should be connected to a read replica, not the primary. This is not just about security, though it is also about security. It is about limiting blast radius. When the agent does something unexpected, and it will, you want the set of possible consequences to be small and reversible. My book Agents That Actually Work: The narrow band where autonomy earns its keep goes deep on this framework, with production examples and the failure modes that shaped each principle.

The tasks where agents earn their keep are bounded, reversible, verifiable, and tool-scoped. Remove any one of those four properties and you are no longer in the narrow band. You are in the territory where agents fail quietly and you find out later.

Concrete examples from my own deployments. Data extraction from unstructured documents, PDFs, emails, scanned invoices, works well when you validate output against a schema and have a fallback path for low-confidence extractions. First-pass triage of support tickets against a known taxonomy works well when humans review the triage output before it triggers any downstream action. Generating draft responses to common inquiry patterns works well when a human reads and edits before sending. Automated monitoring that surfaces anomalies and writes a structured incident report works well when it is advisory, not responsive, it tells you something might be wrong, it does not try to fix it.

What does not work well in my experience: long-horizon research tasks where the agent must maintain a coherent goal across many steps and many hours. Planning tasks where the goal itself is ambiguous and the agent must clarify requirements before proceeding. Anything where the agent needs to negotiate with external parties, customers, vendors, partners, because the relational and reputational stakes are too high and the agent has no real sense of them. And anything involving irreversible financial or legal actions, full stop, no exceptions.

Memory: the part everyone gets wrong

The most underrated problem in agent deployment is memory. When a language model handles a single conversation, memory is not an issue, the entire conversation is in context. When an agent runs across multiple sessions, or when you are operating multiple agents that need to share state, or when an agent needs to remember something from a run three days ago to make a good decision today, you need an explicit memory architecture. And most teams treat this as an afterthought.

I have seen agents fail badly because of memory design errors. An agent that is supposed to escalate issues that have been open for more than 48 hours does not escalate them because it has no reliable way to know when an issue was first opened, the timestamp is stored in a format it interprets inconsistently. An agent that is supposed to avoid sending a follow-up message to a customer who already received one today cannot check that reliably because the sent-message log is in a different system and the agent's tool for querying it does not handle timezone boundaries correctly. These are not exotic edge cases. They are the kinds of things that fail in the first week of production use.

Good memory architecture for agents involves three things: a clear distinction between working memory (in-context, ephemeral), episodic memory (structured logs of what happened in past runs), and semantic memory (facts and preferences that should persist). Working memory is handled by prompt design. Episodic and semantic memory require deliberate infrastructure choices, where the data lives, how it is indexed, how the agent retrieves it, and how stale entries are managed. Memory Systems for Agents is the most rigorous treatment of this I have found, it changed how I think about agent persistence and is required reading for anyone building multi-session workflows.

The practical implication: before you deploy any agent that needs to reason about its own history, audit every place it needs to recall something. Ask: where does that information live? How does the agent access it? What happens if the access fails? What happens if the information is stale? If you cannot answer those questions cleanly, you are not ready to ship.

Observability: knowing why the agent did that

Here is a question I ask every team that tells me they are deploying agents in production: when the agent does something unexpected, how long does it take you to figure out why?

For most teams, the honest answer is "too long." They have logs of what the agent did, tool calls, outputs, final results. They may not have a clean trace of the reasoning that led from input to decision. They cannot answer "the agent sent the wrong escalation, what did it see that made it think escalation was appropriate?" without significant forensic work.

This matters more than most people think, and not just for debugging incidents after the fact. If you cannot inspect an agent's reasoning process during a run, you cannot intervene intelligently. You are reduced to watching the outputs and hoping they are acceptable, which is not a posture you can maintain in a production system for anything consequential.

What I consider the minimum viable observability stack for an agent in production: structured traces of every reasoning step and tool call, with inputs and outputs; latency and cost attribution by step; a mechanism to tag runs as "interesting" for later review; dashboards that surface behavioral drift over time, not just whether the agent succeeded, but whether its approach is changing in ways that might indicate model updates or data distribution shifts upstream; and human review queues that sample a percentage of runs for spot inspection regardless of outcome. The discipline I have found most useful here treats observability for AI systems as a first-class concern, covering tracing, evaluation loops, and the org practices that make observability useful rather than theatrical.

If you cannot answer "why did the agent do that?" within ten minutes of being asked, you do not have production-grade observability. You have logging. Those are different things.

We do post-mortems on significant agent incidents the same way we do post-mortems on infrastructure incidents. Not blame-focused, but causally rigorous. What did the agent see? What decision did it make? Was the decision reasonable given what it saw, or is there a reasoning error we need to address? Is the issue in the model, in the tool, in the prompt, in the memory system, or in our evaluation rubric? You cannot run this process without observability infrastructure. You end up guessing, and guessing is how you end up with the same incident again next month.

Human escalation as first-class design

There is a version of agent design that treats human escalation as a failure mode, something that happens when the agent cannot handle a case, a regrettable exception. This is wrong. Human escalation is a feature, and in my experience the teams that build it in as a first-class design element get to reliable production systems faster than teams that treat it as an edge case.

At Devlyn, we have a principle we call "senior owns production." It means that regardless of how much autonomy we extend to any automated system, a senior person is always in the loop for anything that could affect a customer relationship or a material business outcome. The agent is not the decision-maker for those cases. The agent is the triage layer, it handles what it can handle well, and it routes everything else to a human with enough context for that human to act quickly.

This design requires two things that teams often skip. First, the agent needs to know when to escalate, not just when it encounters an error, but when it encounters a situation where its confidence in its own reasoning falls below a threshold. This requires the agent to have calibrated self-assessment, which is a real capability that you have to build and test explicitly. You cannot assume it. Second, the escalation path needs to actually work. There has to be a human available to receive the escalation, a mechanism to get them the right context, and a response time expectation that the agent can plan around. If the escalation queue backs up, the agent is stuck or makes a decision it should not have made. The human handoff is part of the system, and it needs to be designed and monitored like any other part of the system.

I have zero tolerance for "we will figure out human escalation later." Later means after something goes wrong. That is a choice, and it is a bad one.

No fantasy timelines

I want to close with something that feels almost embarrassingly basic, and yet I keep seeing it violated in practice. Do not make commitments about agent capabilities based on demo performance. Demos are curated. Production is not curated. A demo runs a task in the most favorable conditions the presenter can engineer. Production runs the task when the input is slightly malformed, when the external API is slow, when the context has grown large enough to degrade reasoning quality, when the model update last week changed a behavior you depended on without anyone announcing it.

The gap between demo performance and production reliability for agents is the largest I have seen for any category of software in twenty years. It is not because the technology is fraudulent. It is because agents are genuinely sensitive to distribution shifts in a way that traditional software is not, and because the evaluation infrastructure needed to catch regressions before they reach production is still immature and under-invested in most organizations.

The teams I respect most are the ones who build boring-looking agents that do narrow things well, invest heavily in evaluation and observability, extend the scope of autonomy only when the evals support it, and refuse to pretend otherwise. They are not the teams with the best demo reel. They are the teams with the best track record of things not breaking in front of customers.

The narrow band is real. Tasks that are bounded, reversible, verifiable, and tool-scoped, these are places where agents earn their keep. Good memory architecture keeps agents coherent across sessions. Good observability tells you why the agent did what it did. Least-privilege tool scopes limit blast radius when things go wrong. Human escalation handles the edges that no system should handle autonomously. These are the foundations. Everything else is detail.

Start there. Build the evaluation infrastructure before you extend the scope. Do not make promises about timelines based on demos. Keep senior people in the loop on anything that matters. And be honest, with your team, your stakeholders, and yourself, about where you actually are in the journey.

That honesty is not pessimism. It is the prerequisite for building something real.

Frequently asked questions

What can AI agents actually do reliably today? They earn their keep on tasks that are bounded, reversible, verifiable, and tool-scoped: data extraction from unstructured documents, first-pass triage against a known taxonomy, draft generation a human reviews before sending, and advisory monitoring that surfaces anomalies. Remove any one of those four properties and you have left the narrow band.

What can agents not do well yet? Long-horizon research that demands a coherent goal across many hours, planning where the goal itself is ambiguous, negotiating with customers or vendors where relational stakes are high, and anything involving irreversible financial or legal actions. Those are not edge cases to engineer around; they are the boundary of the narrow band.

What does it take to deploy an agent safely in production? Evaluation infrastructure before scope, explicit memory architecture, observability that answers "why did the agent do that?" in minutes rather than hours, least-privilege tool scopes, and human escalation designed as a first-class feature rather than a failure mode. If you want help building that foundation, that is the work my team does at Devlyn.

An honest accounting of what agents can do today

What "agent" actually means in production

The narrow band: what actually works

Memory: the part everyone gets wrong

Observability: knowing why the agent did that

Human escalation as first-class design

No fantasy timelines

Frequently asked questions

Keep reading

Principles of Building AI Agents That Hold in Production

How to Build an AI Agent (the Loop That Holds)

Agentic AI Frameworks Compared (From Production)