Principles of Building AI Agents That Hold in Production

The principles of building AI agents do not live in any framework: bound the autonomy, name what you never delegate, evaluate continuously, and design honest memory.

The principles of building AI agents do not live in any framework. They sit underneath all of them: bound the autonomy, name what you will never delegate, evaluate continuously, and design memory that is honest about what it knows. Learn those four and you can pick up LangGraph, CrewAI, or the OpenAI Agents SDK in an afternoon. Skip them and the framework will not save you.

I have watched this play out from both seats. As an engineer I have traced agent runs at 3am to find out why a "working" system did something nobody asked for. As an operator I have signed off on the budget for those systems and answered for them when they failed. The frameworks change every quarter. The reasons agents break in production have not changed at all.

Here is the number that should anchor every design decision. If each step in an agent's loop succeeds with probability p, an n-step task succeeds end to end with probability p^n. At 95% per-step accuracy, a 20-step task completes only about 35% of the time. The compounding math is unforgiving, and it explains why Gartner predicts more than 40% of agentic AI projects will be canceled by the end of 2027, citing escalating costs, unclear business value, and inadequate risk controls. The principles below are how you fight the exponent. This guide expands on the narrow-band argument in my pillar field guide to AI agents and agentic workflows.

The frameworks change every quarter. The reasons agents break in production have not changed at all.

Key takeaways

If you read nothing else, read these.

Reliability compounds downward. A 95% per-step agent completes a 20-step task only about 35% of the time. Short loops beat long ones, every time.
Bound autonomy first. Give the agent the smallest scope that does the job, then widen it only when your evals earn the right.
Name what you never delegate, and enforce it in the tool layer. Prompts are advisory and models are persuadable; a guardrail that does not exist in the tool set cannot be talked around.
"A human reviews it" is not an evaluation plan. Score the trajectory, not just the final answer, and run the harness on live traffic forever.
Memory must be honest about its own freshness. An agent that cannot say "I do not know" will confidently act on stale or invented facts.

Bound the autonomy before you extend it

The first principle: give an agent the smallest possible scope that still does the job, then widen it only when your evals earn the right. Autonomy is not a virtue. It is a liability you take on deliberately, in exchange for a specific payoff.

The reasoning is the compounding math again. Every additional step the agent decides on its own is another factor of p in p^n. A bounded task with a clear start and stop state keeps n small. "Triage these 200 tickets and draft a response queue" is bounded. "Run our support function" is not. Bounding the task is the single cheapest way to raise reliability, and most teams skip it because the demo looked impressive at full scope.

Concretely, bounding means three things: a defined input and output, a tool set scoped to least privilege, and a step ceiling that stops the loop instead of letting it wander. An agent that drafts emails does not get a send key. An agent that reads a database connects to a read replica, not the primary.

A founder I advised, Maya, wanted an agent to run onboarding for her SaaS end to end: read each signup's setup, configure the workspace, email the customer, and book a call. It demoed beautifully and broke in week two, when it misread a plan field, configured the wrong billing tier, and emailed a customer a quote that was off by a factor of three. We cut the scope to one bounded, reversible job: draft the onboarding email for a human to send. That version has run for four months without an incident. The whole task looked like an agent; the part that earned its keep was a sliver of it.

Failure mode when ignored: the agent runs longer than you expected, takes an action you cannot undo, and reports success anyway because it has no mechanism to surface its own error. By the time you notice, the blast radius is whatever you handed it. The OWASP Top 10 for Agentic Applications, released in late 2025, names this surface directly under goal hijack and memory poisoning. Bounded scope is what keeps the worst case small. If you want a team that scopes agents this way from day one, you can hire AI engineers who have shipped them in production.

Name what you will never delegate

The second principle is the one that ties directly to revenue, and it is the one vendors never write about. Decide, in advance and in writing, which decisions a human always makes. Then enforce it in the tool layer, not in the prompt.

This is the both-seats principle. From the engineering seat, a human checkpoint is a control surface. From the revenue seat, it is an accountability boundary. When an agent sends a discount, signs a contract clause, or emails a customer, someone owns the consequence. If you cannot name who owns it, the agent should not be allowed to do it. At Devlyn we run this as a standing rule: a senior person is always in the loop for anything that touches a customer relationship or a material financial outcome. The agent triages; the human decides.

The reason to enforce this in the tool layer is blunt: prompts are advisory and models are persuadable. Cisco's security team argues that prompt injection is the new SQL injection, and guardrails are not enough, because a model processes system instructions, user input, and retrieved context as one continuous stream of tokens with no architectural way to tell trusted commands from untrusted data. A guardrail written into the system prompt can be talked around. A guardrail that simply does not exist in the agent's tool set cannot.

If you cannot name who owns the consequence of an action, the agent should not be allowed to take it.

Picture the difference in one team. A fintech group I worked with had an agent that could issue account credits up to a cap, gated only by a system-prompt rule that said "never exceed $50 without approval." A crafted support message walked it past the rule twice in a week. We moved the cap into the tool itself: the credit endpoint refused any amount over the threshold and returned a flag for human approval instead. The same prompt-injection attempts hit the new design and did nothing, because the dangerous action no longer existed in the agent's reach.

Failure mode when ignored: the agent does something irreversible and expensive, and your post-mortem discovers there was never a clear answer to "who was supposed to approve this?" That is not a model failure. That is a design failure, and it lands on the P&L. The narrow band where autonomy earns its keep is the subject of my honest accounting of what agents can do today and the book it grew into, Agents That Actually Work. The discipline of naming undelegated decisions is the whole argument of my book Human in the Loop Is Not a Plan.

Evaluate continuously, not at launch

The third principle: "a human reviews it" is not an evaluation plan. Build a harness that scores the agent on the failure modes that actually cost you, and run it before every change and on a sample of live traffic forever.

The evidence here is stark, and it comes from production rather than the lab. One March 2026 reliability report analyzed more than 4.4 million tests across 6,259 deployed agents in ten regions and found an aggregate success rate of 56.6%, far below the near-perfect numbers those same agents post on benchmarks. The agents did not get worse in the field. The evaluation was measuring the wrong thing. A lab benchmark rewards average-case behavior on clean inputs; production punishes the tail, where requests are ambiguous and APIs are flaky.

A useful agent eval looks different from a single-prompt eval. You score the whole trajectory, not just the final answer. Did the agent call the right tools in a sensible order? Did it escalate when its confidence was low? Did it stop when it should have stopped? You also track behavioral drift over time, because a model update upstream can change a behavior you depended on without anyone announcing it. This is where evaluation and the human checkpoint meet: a sampled human review is not the whole plan, but it is the labeled data that keeps the automated harness honest.

# A graded agent run, scored on the trajectory, not just the answer

RUN ticket-triage agent=v4 model=mid-tier steps=5

step 1 tool=search_kb ok latency=410ms

step 2 tool=classify ok conf=0.88

step 3 tool=check_refund ok matched=true

step 4 action=flag_for_human ok # correct: never auto-decide refunds

step 5 action=stop ok

EVAL trajectory="pass" cost=$0.009 p95_latency=2.1s human_escalation=correct

Notice what the eval grades: the path, the cost, the latency, and whether the agent escalated the one decision it was told never to make alone. A final answer that looks right with a broken trajectory is a regression waiting to happen.

Failure mode when ignored: the suite passes every check right up until launch, and a customer finds the regression for you. Anthropic's own guidance on building effective agents leans on transparency for this reason: show the agent's planning steps so a human can inspect them. You cannot evaluate reasoning you cannot see. If you would rather have observability and evals built in from the start than bolted on after an incident, that is the work Devlyn does on AI observability and monitoring.

Design memory that is honest about what it knows

The fourth principle is the most underrated: an agent's memory should be honest about its own confidence and freshness, or it will confidently act on stale or invented facts. Memory is where reliable-looking agents quietly go wrong.

Working memory lives in the context window and is ephemeral. Episodic memory is the structured log of what happened in past runs. Semantic memory holds the facts and preferences that should persist. The honesty problem shows up at the boundaries. An agent told to escalate issues open more than 48 hours fails to escalate because it reads the timestamp inconsistently. An agent told not to double-message a customer cannot check reliably because the sent-message log lives in another system with a timezone bug. These are not exotic edge cases. They fail in the first week.

Honest memory means every retrieved fact carries provenance and freshness, and the agent is built to say "I do not know" instead of filling the gap. Anthropic treats context engineering in 2026 as a first-class design problem rather than a prompt afterthought, warning that context windows of every size are subject to pollution and relevance decay. The deeper architecture for it is the subject of my guide to memory systems for agents, and the retrieval side, where memory meets RAG, is where Devlyn's work on RAG and knowledge integration earns its keep.

Failure mode when ignored: the agent retrieves a stale record, treats it as current, and takes an action that was correct last week and wrong today. Because the memory had no concept of its own staleness, nothing flagged it. The agent looked reliable right up until the moment it was not.

A note on frameworks

None of these principles tell you which framework to use, and that is the point. Anthropic's advice to keep agents simple and prefer composable patterns over heavy abstraction holds because the framework is the easy part. The hard part is the four constraints above, and they are framework-agnostic. Start with the smallest thing that works, add abstraction only when the system demands it, and let your evals decide when to widen scope.

There is a real trade-off here worth naming. Bounding autonomy and enforcing human checkpoints makes agents slower and less impressive in a demo. You give up the long, autonomous run that wows a stakeholder. What you get back is a system that holds at 3am and does not put your name on a mistake you never approved. In production, that trade is not close.

Frequently asked questions

What are the core principles of building AI agents?

Four durable principles sit beneath every framework: bound the autonomy to the smallest scope that does the job, name in writing what a human will never delegate and enforce it in the tool layer, evaluate continuously on real failure modes rather than only at launch, and design memory that is honest about its own freshness and confidence. Frameworks come and go; these constraints decide whether the agent holds in production.

Why do AI agents fail in production?

Mostly because of compounding error. If each step succeeds with probability p, an n-step task succeeds with probability p^n, so a 95% per-step agent completes a 20-step task only about 35% of the time. Production reliability data backs this up: one large 2026 study of deployed agents measured an aggregate success rate near 57%, well below their benchmark scores, because evals measured average-case behavior while production punishes the tail.

How much autonomy should an AI agent have?

As little as the job requires. Autonomy is a liability you take on deliberately in exchange for a specific payoff, not a default. Start bounded, with a defined input and output, least-privilege tools, and a step ceiling. Widen scope only when your evaluation harness shows the agent is reliable at the current scope.

How do you stop prompt injection from breaking an AI agent?

You cannot fully patch it at the prompt layer, because a model reads trusted instructions and untrusted data as the same stream of tokens. The durable defense is architectural: do not give the agent a tool that can take a dangerous action in the first place. Cap risky operations inside the tool itself, require human approval for anything irreversible, and treat every guardrail you can only express in the system prompt as advisory rather than enforced.

What is the best framework for building AI agents?

The wrong question to lead with. Anthropic and most production teams now recommend starting with simple, composable patterns and adding framework abstraction only when the system demands it. The framework is the easy part; bounding autonomy, enforcing human checkpoints, evaluating continuously, and designing honest memory are what actually determine reliability.

If you are turning these principles into a system that has to hold under real load, that is the work my team does every day. See how a Devlyn AI engineering team ships agents with bounded scope, named checkpoints, and evals built in from day one. The principles are free. Putting them into production reliably is the job.