How to Build an AI Agent (the Loop That Holds)

How to build AI agents that hold: spec the task, give it bounded tools, add guardrails in code, wire evals, and ship behind a human gate.

To build an AI agent, build a loop. Here is the recipe in five moves: spec the task and its success criteria, give the model a small set of bounded tools, enforce guardrails in code, wire evals against real traces, then ship it behind a human gate. That is the whole thing. Everything else is a framework choice, and the framework is the easy part.

I am going to walk through how to build AI agents the way I actually ship them, from both seats. As an engineer I have traced agent runs at 3am to find out why a "working" system spent forty dollars looping on a malformed tool call. As an operator I have signed the budget for those systems and answered for them when they sent something a customer was not supposed to see. The reflexes below are what separate a demo that wows a stakeholder from a system that holds. This is the build-level companion to my field guide to agentic workflows, which makes the case for when an agent is the right tool at all.

An agent is a loop: plan, call a tool, read the result, decide whether to stop. The work is making that loop safe to leave running.

The reason to take the loop seriously is arithmetic. If each step succeeds with probability p, an n-step task succeeds end to end with probability p^n. At 95% per-step accuracy a 20-step task completes only about a third of the time, and the compounding math is unforgiving: five steps land around 77%, twenty steps around 35%, fifty steps around 7%. Most of this guide is about fighting that exponent.

Key takeaways for building AI agents

If you read nothing else, read these.

An agent is a loop, not a framework. Plan, call a tool, read the result, decide whether to stop. You can write it by hand in under a hundred lines, and doing so once shows you what every framework is doing for you.
The spec is the program. If you cannot state the success criterion as something you could test, you are not ready to build. Input, output, allowed tools, stop condition.
Guardrails belong in code, not the prompt. A prompt is advisory and a model is persuadable. Step ceilings, spend budgets, and tool permissions live in the code that wraps the model call.
You score the trajectory, not just the answer. Build the eval harness from real traces and run it on every release, because an upstream model update can shift a behavior you depended on.
Reliability compounds downward. Bound the task so the step count stays small, gate the irreversible actions behind a human, and widen scope only when evals earn it.

If you are weighing whether a process should be an agent at all before you build it, the narrow-band framework in my book Agents That Actually Work is the faster read. The rest of this piece assumes you have decided to build.

Start with the spec, not the framework

Step one is not picking LangGraph or the OpenAI Agents SDK. Step one is writing down, in plain language, what the agent does and how you will know it worked. If you cannot state the success criterion as something you could test, you are not ready to build the agent.

A usable spec has four parts: the input the agent receives, the output it must produce, the tools it is allowed to touch, and the explicit stop condition. "Triage inbound support tickets and draft a response for each, escalating anything about billing" is a spec. "Handle support" is a wish. The first one tells you exactly what to evaluate. The second one guarantees you will argue about whether it worked.

This is the same move I make everywhere, because the spec is the source of truth for any AI-Native system, an idea I argue in my AI-Native thesis. The spec is what your evals score against later. Write it first and the rest of the build has a target. Skip it and you are tuning a system with no definition of done, the failure I keep coming back to in the principles of building AI agents.

Build the loop, see the trace

An agent is an LLM calling tools in a loop until it reaches a goal or hits a budget. Anthropic's definition is deliberately plain, and I keep it plain on purpose. Before you reach for a framework, write the loop once by hand so you can see every turn. Here is the shape of it, with the trace it should print.

# the loop, stripped to its bones

while step < MAX_STEPS and not done:

reply = model.respond(messages, tools=ALLOWED)

if reply.tool_call:

result = run_tool(reply.tool_call) # validated, idempotent

messages.append(result)

else:

done = True

step += 1

Now run it and watch the trace, because the trace is the product. A good agent log is legible at a glance and tells you what the model decided and why.

# one agent run, abbreviated trace

[1] plan "classify ticket, then draft reply"

[2] tool get_ticket(id=8842) -> "category: billing"

[3] guard billing -> ESCALATE (no auto-reply on billing)

[4] stop handed to human queue, step=3/12

The honest part: most of what you build is not the model call. It is the plumbing around line three of that trace. The validation, the budgets, the escalation rule, and the logging are the agent. The model.respond call is one line. Teams who think the model is the hard part ship demos. Teams who think the loop is the hard part ship systems.

Give it bounded tools, not a toolbox

Tools are how an agent acts, and the cheapest reliability win is to give it fewer of them. Selection quality drops as the option count climbs, so a tight, well-documented tool set beats a sprawling one every time. Each tool is an API you own: validate its inputs and outputs, make its side effects idempotent, and put a time and cost budget on it.

Bounding tools is also where engineering meets the P&L. An agent that drafts emails does not get a send key. An agent that reads customer data connects to a read replica, not the primary. The point is not theoretical tidiness. It is that the blast radius of a bad decision is exactly the set of tools you handed over. Scope the tools and you scope the worst case.

The protocol layer has settled enough to lean on. The Model Context Protocol gives you one way to expose tools that most agents can consume, so you implement an integration once instead of per framework. That is a real convenience. It does not change the rule: every tool the agent can reach is a decision you have delegated, and you should be able to name why. The deeper patterns for this live in the agentic design patterns that hold in production.

Put guardrails in code, not in the prompt

Here is the principle I will not bend on: enforce guardrails in code, not in the system prompt. A prompt is advisory, and a model is persuadable. A guardrail that lives in the prompt can be talked around. A guardrail that does not exist in the tool layer cannot.

Concretely, that means three brakes on the loop. A step ceiling that stops the agent instead of letting it wander. A spend budget that halts the run when token or tool cost crosses a line. And hard exit conditions, including a self-correction step where a failed tool result is revised mid-run rather than passed downstream to poison the next decision. None of these live in the prompt. They live in the code that wraps the model call.

A guardrail in the prompt can be talked around. A guardrail that does not exist in the tool layer cannot.

Failure mode when you skip this: the agent treats a malicious instruction buried in retrieved data as a real command, because an LLM has no built-in way to tell trusted instructions from untrusted content when both arrive as the same tokens. That is why prompt injection is now treated as an architectural constraint, not a bug you patch. The defense is structural. The agent can only do what its tools permit, and the dangerous tools are not in the set.

Wire evals from real traces, run them on every release

"A human reviews it" is not an evaluation plan. Build a harness that scores the agent on the failure modes that actually cost you, derive the cases from real traces, and run it on every release so a regression fails the build before a customer finds it.

Agent evals look different from single-prompt evals because you score the whole trajectory, not just the final answer. Did the agent call the right tools in a sensible order? Did it escalate when its confidence was low? Did it stop when it should have stopped? Anthropic's guidance on agent evals makes the same point: good evals make behavioral changes visible before they reach users, which matters because a model update upstream can shift a behavior you depended on without anyone announcing it.

The thing that should scare you into doing this is the gap between lab and production. An agent can clear a benchmark and still fail on real traffic, because the benchmark measures average-case behavior while production punishes the tail. The agent did not get worse; you simply never measured the cases that break it. The fix is an eval suite built from your own traffic, which is exactly the discipline behind evals for agents that predict production. For the framing under it, my honest accounting of what agents can do today and the book it became, Agents That Actually Work, both start from the same premise: evaluation is the job.

This is also the point where most teams discover they wired evals too late. If you would rather have the harness and the observability built in from day one than reconstructed after an incident, that is the work a Devlyn AI engineering team does, with the observability and monitoring sitting under the agent from the first deploy rather than bolted on after the first customer complaint.

Ship behind a human gate

The last step is the one that decides whether you sleep. Ship the agent behind a human gate at the decisions you will never delegate, and enforce the gate in the tool layer where it cannot be skipped.

Modern frameworks make this concrete. The OpenAI Agents SDK lets a tool declare that it needs approval, which pauses the run, surfaces the pending action, and resumes only after a human approves or rejects it. The agent triages; the human decides. That is the thesis in one mechanism: the machine does the work, and the human evaluates.

From the revenue seat this gate is an accountability boundary, not a nicety. When an agent sends a discount, edits a contract, or emails a customer, someone owns the consequence. If you cannot name who owns it, the agent should not be allowed to do it. The trade-off is real and worth saying plainly: a human gate makes the agent slower and less impressive in a demo. You give up the long autonomous run that wows a stakeholder. In exchange you get a system that does not put your name on a mistake you never approved. In production that trade is not close.

This is also why I am skeptical of starting with elaborate multi-agent swarms. A narrow, observable workflow that succeeds 99% of the time and escalates the rest beats a clever multi-agent system that collapses in week two. Bound it, gate it, then widen scope only when your evals earn it. The durable version of that argument is in the principles of building reliable AI agents.

Where this breaks

Being honest about the failure surface is part of the build. Three places break most often, in my experience.

Memory breaks first. One support agent I reviewed was told to escalate any ticket open more than 48 hours, and it quietly missed a third of them because the sent-message log it read lived in a second system with a timezone bug. Stale facts read as current facts, and the agent acted on them with full confidence. Second, tool sprawl: every tool you add lowers selection accuracy and widens the blast radius. A team I advised had an agent that handled five tasks at 96% reliability, then watched it slide toward the low 80s once they pushed it to fifteen, because the model now chose wrong among too many lookalike tools. Third, the silent regression, where an upstream model update changes a behavior and nothing flags it until a customer does. That last one cost one team a week of wrong refund quotes before anyone noticed, which is the entire argument for continuous evals over a one-time launch check.

None of these are exotic. They show up in the first week of real traffic, not the edge of the distribution. If your framework picks, that is fine; the honest comparison of agentic AI frameworks from production walks through what each one costs you here.

Frequently asked questions

How do you build an AI agent step by step?

Build it as a loop in five moves. First, write a spec with a testable success criterion: input, output, allowed tools, and stop condition. Second, build the plan-act-observe loop and make the trace legible. Third, give it a small set of bounded, validated tools. Fourth, enforce guardrails in code, including step ceilings, spend budgets, and exit conditions. Fifth, ship behind a human gate at the decisions you will never delegate, with evals running on every release. The framework you use to assemble these is the least important choice.

Do I need a framework like LangGraph or the OpenAI Agents SDK to build an agent?

No. An agent is an LLM calling tools in a loop, and you can write that loop by hand in well under a hundred lines. Doing it by hand once is the fastest way to understand what the framework is doing for you. Reach for a framework when you need its specific features, such as built-in human-in-the-loop approvals, handoffs between specialist agents, or streaming. Pick the abstraction the system demands, not the one with the best landing page.

Why do AI agents fail in production?

Mostly compounding error. If each step succeeds with probability p, an n-step task succeeds with probability p^n, so a 95% per-step agent completes a 20-step task only about a third of the time. A second cause is the gap between lab and production: an agent clears a benchmark that measures average-case behavior, then meets the tail cases the benchmark never tested. The fix is to bound the task so n stays small, then evaluate continuously on your own real traffic.

How do you keep an AI agent safe to run unattended?

Enforce limits in code, not in the prompt. Give the agent the smallest tool set that does the job, with no access to irreversible actions. Put a step ceiling and a spend budget on the loop. Gate any high-stakes action behind human approval at the tool layer, so it pauses and waits rather than acting. Then run evals built from real traces on every release so a behavior regression fails the build before a customer hits it.

If you are turning this loop into a system that has to hold under real load, that is the work my team does every day. See how a Devlyn AI engineering team builds AI agents with bounded tools, guardrails in code, and evals wired in from day one. The loop is free to write. Making it safe to leave running is the job.