AN Alpesh Nakrani
BlogBooksPraiseAbout Work with me →
Book overview
Chapter 7 / The AI-Native Canon

Autonomy Boundaries and Evaluation Strength

The more autonomy a system has, the stronger its evaluation must be.

Research spine: this chapter stays grounded in NIST AI Risk Management Framework and NIST Secure Software Development Framework, then applies that evidence to the operating judgment in the book. The more autonomy a system has, the stronger its evaluation must be. This sounds obvious until a team ships an agent that can change state, call tools, message customers, or create tickets using the same evaluation discipline they used for a text-generation demo.

Autonomy changes the risk surface. A summarizer can be wrong. An agent can be wrong and act. A recommender can show a bad item. A workflow agent can trigger side effects across systems. A coding assistant can generate a vulnerable change. A finance assistant can produce a number that someone relies on. The question is not whether humans are in the loop. The question is what autonomy is allowed before, during, and after evaluation.

Anthropic's practical writing on building effective agents emphasizes that many successful systems are structured workflows rather than unconstrained autonomous loops, and that agents should be used where the task warrants the complexity. Source: https://www.anthropic.com/engineering/building-effective-agents This fits the argument here: autonomy must earn its keep, and evaluation must match the autonomy granted.

Key Takeaways

  • The more autonomy a system has, the stronger its evaluation must be.
  • The practical test is whether a team can name the evidence, owner, and failure mode before it changes behavior.
  • Read this with Human in the Loop Is Not a Plan and the adjacent chapters when you need the wider Evals and Evaluation frame.

The Autonomy Ladder

A useful ladder:

LevelSystem behaviorEvaluation requirement
AssistDrafts or suggests; human actsOutput quality evals and reviewer rubric
RecommendRanks options; human selectsRanking evals, slice analysis, decision audit
PrepareCompletes work but does not executeEnd-to-end scenario evals and approval loop
Execute reversibleActs but can undoPre-release evals, monitoring, rollback tests, sampling
Execute irreversibleActs with hard-to-reverse consequencesStrong pre-action review, audit, red-team, policy signoff
Self-adjustChanges its own instructions/toolsChange-control evals, governance, strict boundaries

Most teams should live lower on the ladder than their demos suggest. A demo that shows an agent handling a whole workflow may be exciting, but the production version may sensibly prepare the workflow and ask a human to execute. That is not failure. It is matching autonomy to evidence.

Evaluation Strength

Evaluation strength has several dimensions:

  1. Coverage. Does the eval suite include common cases, edge cases, high-risk cases, and adversarial cases?
  2. Realism. Do cases resemble production inputs and tool states?
  3. Outcome validity. Are expected answers or actions defensible?
  4. Slice visibility. Can failures be seen by segment, language, workflow, or risk tier?
  5. Regression power. Would the suite catch known historical failures?
  6. Operational linkage. Are eval results tied to release gates?
  7. Human calibration. Are labels and rubrics stable enough to trust?
  8. Side-effect simulation. For agents, are tool calls and state changes tested safely?

A stronger eval is not simply bigger. A thousand easy cases do not protect a high-risk workflow. The eval must match the failure modes.

HELM's benchmark philosophy is useful because it pushes toward broad, transparent evaluation across scenarios and metrics rather than single-score claims. Source: https://crfm.stanford.edu/helm/latest/ RAGAS is useful for retrieval and generation pipelines because it separates dimensions such as faithfulness and context relevance. Source: https://arxiv.org/abs/2309.15217 The principle applies beyond those tools: evaluate the pieces that can fail, not only the final output.

Permissioned Autonomy

Autonomy should be permissioned. The system should have a policy that says which actions it may take, under what conditions, with which tools, on whose behalf, and with what review. This policy should be machine-readable enough to enforce.

agent_autonomy_policy:
 workflow: enterprise_support_agent

 allowed_without_review:
 - create_internal_summary
 - draft_customer_reply
 - tag_ticket

 allowed_with_review:
 - send_customer_reply
 - offer_service_credit_below_100
 - close_ticket_as_resolved

 prohibited:
 - change_contract_terms
 - issue_refund_above_500
 - disclose_cross_tenant_information

 required_evidence:
 send_customer_reply:
 - cited_policy_document
 - customer_issue_summary
 - confidence_above_0_85
 offer_service_credit_below_100:
 - account_standing
 - credit_policy
 - manager_override_if_repeat_customer

This is where human loops and evals connect. The autonomy policy determines which actions require review. The eval suite tests whether the system follows the policy. Monitoring checks whether production behavior matches it. Audit logs preserve evidence.

Tool Use and Side Effects

Tool use is the boundary where language output becomes operational action. A model drafting "I will refund the customer" is one thing. A model calling issue_refund(customer_id, amount) is another. The evaluation system must inspect tool selection, parameters, ordering, retries, and side effects.

Common tool-use failures include calling the right tool with the wrong parameter, calling tools in the wrong order, retrying non-idempotent actions, failing to check permissions before mutation, using stale retrieved data, and continuing after an error that should stop the workflow. These failures may not appear in final answer evaluation. The final answer may say "done" while the tool trace reveals the system did something unsafe.

Therefore, agent evals need trace evaluation. Did the agent call only allowed tools? Did it check preconditions? Did it stop when evidence was missing? Did it expose side effects to the reviewer? Did it preserve rollback data? Did it produce an audit record?

The Human Role at Each Autonomy Level

The human role changes as autonomy increases. At assist level, the human does the work and may use AI to draft. At prepare level, the human approves. At reversible execution level, the human monitors and handles exceptions. At irreversible execution level, the human sets policy, approves high-risk actions, and audits. At self-adjusting level, the human governs the system's ability to change itself.

This is why "human in the loop" is too vague. The human may be operator, approver, supervisor, auditor, trainer, policy owner, appeal authority, or incident commander. Each role requires different information.

Release Gates for Autonomy

Autonomy should be released gradually:

  1. Offline evaluation only.
  2. Shadow mode: system proposes actions but does not act.
  3. Human-confirmed mode: every action requires approval.
  4. Tiered mode: low-risk actions auto-execute, high-risk actions require approval.
  5. Monitored autonomy: sampling and monitoring replace approval for selected cases.
  6. Expanded autonomy only after evidence.

Each promotion should require evidence. Shadow mode should compare proposed actions to human actions. Human-confirmed mode should measure approval rate, correction rate, and reason codes. Tiered mode should monitor auto-action outcomes and sampled errors. Expanded autonomy should require stability across time, not one good day.

Exit Criteria

An autonomy boundary without exit criteria is a one-way door. The system should automatically reduce autonomy when quality drops, queue backlog exceeds limits, source freshness fails, tool errors spike, reviewer disagreement rises, or incident conditions are met. Reduced autonomy is not a failure. It is a safety valve.

Autonomy ladder beside an evaluation strength ladder, with stronger oversight required at higher autonomy levels
Each step up the autonomy ladder requires stronger evidence, review, audit, and governance before the system earns more authority.

Chapter Takeaway

Autonomy must be earned with evidence. The higher the system climbs from suggestion to action, the stronger its evals, permissions, review loops, monitoring, and exits must become.

Shadow Autonomy

Shadow autonomy is one of the safest ways to learn. The system proposes actions exactly as it would in production, but those actions are not executed. The team compares proposed actions to human actions, reviewer decisions, policy outcomes, and downstream results. Shadow mode reveals what autonomy would have done without exposing users to its full consequences.

Shadow mode is especially useful for agents. It can record tool plans, intended parameters, stopping behavior, and escalation decisions. Reviewers can inspect traces and classify errors. Engineers can convert failures into trace evals. Product can estimate how much work would be automated and how many cases would still need approval.

The common mistake is making shadow mode too short. One day of shadow traffic may not include rare cases. A useful shadow period covers enough volume, slices, and time variation to see representative behavior. It should include peak periods, new data, known edge cases, and adversarial tests.

Autonomy Should Be Narrow

Many AI failures come from granting broad autonomy when narrow autonomy would have created most of the value. An agent that can fully resolve every support ticket is risky. An agent that can gather evidence, draft a reply, and prepare three safe actions may deliver large savings with less risk. A code assistant that can rewrite an entire service is risky. One that can propose a small patch with tests under a spec gate is more manageable.

Narrow autonomy is not timid. It is how durable systems ship. The workflow should grant autonomy at the narrowest point where the machine creates clear value and evaluation can prove quality. Expand only after evidence.

Evaluating Non-Action

Autonomy evaluation should include cases where the correct behavior is not to act. Models and agents are often rewarded implicitly for completing tasks. But in production, refusing, escalating, or asking for more information may be the safest and most valuable action.

Eval sets should therefore include missing evidence, conflicting policy, unsafe requests, permission failures, ambiguous user intent, stale sources, tool errors, and irreversible actions without authorization. The expected behavior in these cases is stop, ask, escalate, or block. If the system is evaluated only on successful completions, it will learn the wrong lesson: action is always preferred.

Expanding Autonomy as a Governance Decision

Moving from approval mode to monitored autonomy should require governance review. The review does not need to be slow, but it should be explicit. What evidence supports expansion? Which failures remain? What monitoring will catch regression? What exit criteria reduce autonomy again? Who owns the risk?

This prevents autonomy creep. Without governance, teams gradually route more cases to auto-action because the system "seems fine." Then an incident reveals that no one formally accepted the new boundary. Autonomy should be a versioned product capability, not an accident.

Autonomy Boundary Tests

Before increasing autonomy, run boundary tests. These are cases designed to test whether the system stops at the right edge. Examples: a customer asks for a refund above the allowed amount; a user asks for another tenant's data; a tool returns an error after partial success; the required source is stale; two policies conflict; a request is phrased as urgent but lacks evidence; an agent is asked to perform an irreversible action.

A system that succeeds on normal cases but fails boundary tests is not ready for more autonomy. Boundary tests are the eval equivalent of guardrails on a bridge. They matter most where falling is expensive.

Autonomy Rollback

Autonomy needs rollback just like code. If sampled errors rise, if queue backlog spikes, if a model upgrade changes behavior, or if a new incident occurs, the system should move down the autonomy ladder. Auto-action can become approval-required. Approval-required can become manual-only. A tool can be disabled. A workflow can be narrowed to read-only mode.

Rollback should not require debate during an incident. The triggers and mechanisms should be defined before launch. The team should rehearse them. A rollback that exists only in someone's head is not a rollback.

Autonomy and Customer Trust

Some autonomy boundaries are not only technical. They are relational. Customers may accept an AI assistant drafting an answer but object to it making a decision about their contract. They may accept automatic summarization but expect human accountability for denial, pricing, or termination. Trust depends on expectation. A technically safe action can still damage trust if the customer expected a human decision. Therefore, autonomy policy should include user expectation and disclosure, not only internal risk. The system should know where automation is acceptable, where it must be disclosed, and where a human relationship still matters.

Evidence Before Expansion

A team should require a written autonomy expansion packet. It should include current tier distribution, eval results, sampled failure rates, approval and correction rates, incidents, reviewer disagreement, capacity impact, monitoring readiness, and rollback plan. This packet is not bureaucracy. It is the evidence that the machine has earned a larger boundary. Without it, autonomy expands by enthusiasm. With it, autonomy expands by proof. The difference becomes visible the first time the system encounters an edge case no demo included.

Share