How to Evaluate an AI Agent (Evals for Agents)

AI agent evals score the whole trajectory: tool calls, step efficiency, recovery, and goal state, not just the final answer. The harness that gates a deploy.

Evaluating an AI agent means scoring the whole trajectory, not just the final answer. You measure tool-call correctness, step efficiency, recovery from a failed step, and outcome success against a frozen task set. Single-output evals ask "was the answer right?" Agent evals ask "did the system take a sane path to a verifiable end state, and would it do that again?"

This distinction is not academic. An agent can produce a correct-looking final message while having called the wrong tool, mutated the wrong record, and burned forty steps to get there. Score only the last message and you certify a system that is one edge case away from an incident. That is why "a human reviews it" does not scale to multi-step agents: a person can spot-check a paragraph, but no one is reading a 40-step trace on every run at production volume.

A single-output eval grades the destination. An agent eval grades the journey, the destination, and the cost of getting there.

Key takeaways

Before the detail, the claims this piece will defend:

Trajectory beats output. Score tool-call correctness, step count, and recovery, not only the final response.
Outcome and process are different metrics. Outcome asks "did the database reach the right state?" Process asks "was the path sane?" You need both.
Freeze the task set. An ai agent evals harness needs a fixed, versioned set of tasks with known goal states, sampled from real traffic.
Consistency is its own metric. Run each task many times. On public benchmarks, agents that pass once often fail when you demand they pass eight runs in a row.
Human-gate the tail. Automate the bulk of scoring; route the low-confidence and high-blast-radius slice to a senior reviewer.

Why agent eval differs from single-output eval

A single LLM call has one degree of freedom: the text it returns. You score that text against a reference or a rubric and you are done. An agent has many degrees of freedom. It chooses a tool, reads the result, updates its plan, chooses again, and loops until it decides it is finished. Each of those choices is a place the run can go wrong while still ending on a plausible sentence.

I covered the production failure modes in my honest accounting of what agents can do today: agents confidently pursue the wrong goal, hallucinate a tool output and proceed as if it were real, or take an irreversible action and then report success anyway. None of those failures show up if your eval reads only the final message. They show up in the trajectory.

So llm agent evaluation measures things a chat eval never has to. Did the agent pick the right tool for the step? Did it pass valid arguments? When a call failed, did it retry sensibly or spiral? How many steps did it take versus the known-good path? Did it stop, or did it loop until a timeout? The final answer is one signal among six or seven.

Trajectory and step metrics: what to actually measure

A trajectory is the ordered log of everything the agent did: each tool call, its arguments, the observation it got back, and the reasoning step that chose it. Evaluating an agent means scoring that log. The metrics I track on every run:

Tool-selection accuracy. Of the steps that called a tool, what fraction called the right one? This requires the full tool specification in the trace, not just the tool name. You cannot judge "right tool" without knowing what the alternatives were.
Tool-argument validity. Right tool, wrong arguments is a distinct failure. Score it separately or you will conflate two bugs that have two different fixes.
Step efficiency. Actual steps divided by the steps on the known-good path. A 1.0 is optimal. A 3.0 means the agent wandered, which costs tokens, latency, and money even when the outcome is correct.
Recovery rate. Of the runs where a tool call failed or returned an error, what fraction recovered and still reached the goal? This is the metric that separates a brittle demo from a system you can ship.
Loop and timeout rate. The fraction of runs that never terminated on their own. A high number here is a hard blocker regardless of accuracy.

Industry tooling has converged on this trajectory-first view in 2026. Vendors now ship pre-built templates for tool-call accuracy, trajectory convergence, and step-level analysis rather than only final-answer scoring, per Confident AI's agent evaluation guide. The shift is the whole story: from grading the answer to grading the path.

If you are wiring these metrics into a real harness rather than reading about them, that is the work we do at Devlyn: hire AI engineers who instrument the trajectory before the agent ships, not after an incident.

Outcome vs process: you need both numbers

There are two honest questions to ask of an agent run, and they are not the same question.

Outcome eval asks: did the world end up in the right state? For a retail-support agent that means the order was actually canceled, the refund issued, the database row correct. The cleanest outcome check compares the final system state against an annotated goal state. This is exactly how the τ-bench benchmark scores agents: it inspects the database at the end of a conversation and compares it to the goal, rather than trusting the agent's closing message (Yao et al., 2024). State, not narration.

Process eval asks: was the path sane? An agent can reach the right outcome through a reckless path that happened to work this time. It can also fail the outcome through a perfectly reasonable path that hit a genuine ambiguity. If you only measure outcome, you reward luck and punish honest uncertainty. If you only measure process, you certify agents that look tidy and ship the wrong result.

The trap is real even at full tool-call accuracy. A customer-service agent can hit 100% tool-call correctness and still violate policy on an edge case, because correctness in that domain is contextual and defined by people, not by a metric (Confident AI). Process metrics caught the path; only an outcome-and-policy check catches the violation. Report both, weighted for your blast radius.

Outcome without process rewards luck. Process without outcome rewards tidiness. Ship neither alone.

Building an agent task set

The harness is only as honest as the task set behind it. The discipline mirrors what I argued in evals that predict production: sample from real traffic, freeze it, version it, and never let it grow organically. An agent task set adds one demand a chat eval does not have: each task needs a defined goal state, not just a reference answer.

A single task in the set has four parts:

Initial state. The starting fixture: the seeded database, the available tools, the policy the agent must follow.
The instruction. What the user asks, in real-traffic phrasing, including the messy and compound requests that break agents.
The goal state. The verifiable end condition. Which records should exist, which fields should hold which values, which actions are forbidden.
The known-good path. An optional reference trajectory, used to compute step efficiency and to make trajectory disagreement legible.

Build the set by stratifying real sessions: over-weight the runs where the agent's confidence was low, where a human corrected it, where a tool call failed, and where a past incident occurred. A uniform sample under-represents every hard case, and the hard cases are the ones that cost you money in production.

Then freeze it as a named artifact. The same rule applies as for any eval: your score on a frozen set can only go down, which is the point. You want a fixed ruler, not a rubber band. Cut a new version when you add tasks; never edit the old one.

Consistency: run each task more than once

Agents are stochastic, so a single pass tells you almost nothing about reliability. The metric that matters is whether the agent passes the same task on repeated, independent runs. τ-bench formalized this as pass^k: the probability that an agent succeeds on all of k independent attempts at a task.

The published numbers are sobering and worth internalizing before you trust a demo. On τ-bench, even strong function-calling agents succeed on under half the tasks on a single attempt, and consistency collapses under repetition: pass^8 falls below 25% in the retail domain (τ-bench, 2024). An agent that looks like a coin flip per run looks like a long shot once you demand it behave eight times running. Your customers experience the repeated-run distribution, not the lucky single demo.

So report pass^k for a k that reflects your real volume, not pass^1. If an agent handles a thousand cases a day, its single-run accuracy is the wrong headline. The honest trade-off: measuring consistency multiplies your eval cost by k. Run it anyway on the frozen set. Cheap evals that hide variance are how teams ship coin flips.

A realistic agent eval log

Here is an abbreviated run of a trajectory-aware harness against a frozen agent task set. The numbers are realistic but illustrative, not from a specific live system.

# agent eval run against frozen task set agent-set-2026-w24-v1.jsonl

python -m agent_eval.runner \

--suite agent-set-2026-w24-v1.jsonl \

--agent support-agent-2026-06-15 \

--passk 8

# results summary (n=160 tasks)

outcome pass@1 0.78 # goal-state match, single run

outcome pass^8 0.41 # all 8 runs succeed, the number that ships

tool select acc 0.93

tool arg valid 0.88 # right tool, wrong args is the gap

step efficiency 1.62 # 62% over known-good path, flag

recovery rate 0.71 # recovered from failed calls

loop/timeout 3.1% # threshold 2.0%, FAIL

policy violations 2 # human-gated tail, both escalated

verdict GATE BLOCKED # loop rate + 2 policy hits

Read that log and the design becomes obvious. Outcome pass@1 looks shippable at 0.78. The pass^8 of 0.41 tells the truth: under repeated runs this agent fails the task more often than not. The step-efficiency flag says it works too hard even when it succeeds, which is a latency and cost problem before it is a quality one. And the gate is blocked not on the headline accuracy but on the loop rate and two policy violations the tail review caught. That is the harness doing its job.

Human-gate the tail, not the whole queue

You cannot put a human on every agent run. You also cannot put a human on none of them. The pattern that scales is to gate the tail: automate scoring for the bulk of runs and route a small, deliberately chosen slice to a senior reviewer.

The slice is not random. Send a human the runs where outcome and process disagree, where the agent's own confidence was low, where a policy-sensitive tool was touched, and where the action was hard to reverse. That is the same blast-radius logic from why human-in-the-loop is not a plan by itself: the reviewer is a designed component with a defined trigger and a response budget, not a rubber stamp you bolt on after launch. A human reviewing 3% of runs by deliberate selection beats a human nominally reviewing 100% and actually reading none.

This is where engineering meets revenue. Every run you can safely auto-score is margin; every run you must route to a senior reviewer is cost. A good agent eval harness is, among other things, the instrument that tells you exactly how large that human-gated slice has to be, which is the number that decides whether the agent is profitable to run at all.

Where this connects to the rest of the stack

Agent evals are one specialization of a broader discipline. The frozen-set, blinded-rater, human-calibrated mechanics carry straight over from single-output LLM evaluation; agents just add the trajectory layer on top. And the agents worth evaluating are usually built as agentic workflows with bounded, tool-scoped steps, which is precisely what makes their trajectories legible enough to score in the first place. An agent you cannot trace is an agent you cannot evaluate.

FAQ

What are AI agent evals?

AI agent evals are tests that score an agent's full trajectory rather than only its final answer. They measure tool-call correctness, argument validity, step efficiency, recovery from failed steps, and whether the run reached a verified goal state. They run against a frozen, versioned task set so scores are comparable over time.

How is agent evaluation different from LLM evaluation?

LLM evaluation scores one output: the text a model returns. Agent evaluation scores a sequence of decisions: which tools the agent called, in what order, with what arguments, and whether it recovered when a step failed. Agents have many ways to fail mid-run while still ending on a plausible message, so you score the path and the end state, not just the message.

What metrics matter most in an agent eval framework?

The load-bearing metrics are outcome success against a known goal state, tool-selection and argument accuracy, step efficiency versus a known-good path, recovery rate, and loop or timeout rate. For reliability, report pass^k across repeated runs, not single-run accuracy, because production sees the repeated-run distribution.

How often should I run agent evals?

Run the frozen task set on every model swap, every prompt change, and every new tool you grant the agent, the same regression discipline you would apply to evals that predict production. Then sample live traffic continuously and route the low-confidence tail to review. Offline regression catches what you changed; online sampling catches the distribution shift you did not.

Can I just have a human review agent outputs instead?

Not at production volume. A human can spot-check a single answer, but no one reads a 40-step trace on every run when an agent handles thousands of cases a day. The pattern that works is human-gate-the-tail: automate scoring for most runs and route the low-confidence, high-blast-radius slice to a senior reviewer.

Build the agent harness before you widen its scope

If you are putting an agent into production, build the ai agent evals harness before you extend what the agent is allowed to do. Sample real tasks, define their goal states, freeze the set, score the trajectory, and report consistency under repeated runs. My book Agents That Actually Work covers the containment and tool-scoping side, and A Field Guide to Evals covers the harness mechanics in depth.

If you would rather have a team that builds agents with evaluation wired in from day one, that is the work we do: hire AI engineers who ship agents you can actually trust in production, with the trajectory evals to prove it.