Tool Guardrails
> **Working claim: ** Once a model can act, guardrails stop being about text and become about authorization.
Key Takeaways
- Tool Guardrails treats guardrails as placed controls, not a single wall around the model.
- The right question is ROAD: which risk, at which operation, with which action, and which detection signal?
- A useful guardrail system reduces both bypasses and overblocking while keeping residual risk observable.
**Working claim: ** Once a model can act, guardrails stop being about text and become about authorization. The model is an untrusted caller holding your application's credentials, and the only reliable controls are the deterministic ones you would put around any untrusted caller: least-privilege capability manifests, argument validation, and a policy engine that decides, independent of the model's reasoning, whether a tool call is allowed.
The confused deputy with your credentials
Everything before this chapter has been about words: the model reads input, retrieves text, produces output. The moment you give the model tools, functions it can call to look something up, send a message, update a record, move money, the safety problem changes shape entirely. The model is no longer just generating text that a human reads and judges. It is generating actions that the system executes, often automatically, using the application's own credentials and permissions. The introduction's refund attack lived here, and it is worth being precise about why this is the most dangerous operation in the boundary.
The classic security term is the confused deputy: a program with legitimate authority that is tricked by a less-privileged party into misusing that authority. An LLM agent is the confused deputy par excellence. It holds the application's credentials (it can call the refund API, the email API, the database). It is steerable by untrusted input (the user's message, a retrieved document, a tool result). And it has no native concept of authorization, it will call any tool it has been given, with any arguments its reasoning produces, because calling tools is exactly what it was built to do. OWASP LLM06 Excessive Agency names this risk directly: the harm comes from giving the model more capability, autonomy, or permission than the task requires, so that a manipulated or mistaken model can do real damage.
The fatal instinct is to control tool use the way you control text: with a prompt ("only issue refunds when appropriate") or an output classifier. Neither works, for the reason established across the security discussion: the model is the thing being manipulated, so a control inside the model's reasoning can be manipulated along with it. Tool guardrails must be deterministic controls outside the model, enforced by the application, treating every tool call the model emits as an untrusted request, exactly as you would treat an HTTP request from an unauthenticated client.
The capability manifest: least privilege, declared
The first control is to stop giving the model tools by default and start giving it the minimum set the task requires, each with explicit constraints. A capability manifest declares, per tool, what the model may call, with what argument bounds, requiring what authorization, with what side-effect classification. It is the allow-list for actions, and it is the single highest-use tool guardrail because it shrinks the blast radius before any specific call is even considered.
# Capability manifest for a customer-support agent. Least privilege by default.
tools:
- name: lookup_order
side_effect: read # read | draft | write | irreversible
risk_tier: 0
authorization: "requester == order.customer_id" # deterministic, not model-judged
arguments:
order_id: { type: string, pattern: "^ORD-[0-9]{8}$" }
rate_limit: "30/min/principal"
- name: draft_reply
side_effect: draft # produces something a human/next-step reviews
risk_tier: 1
authorization: "authenticated"
arguments:
body: { type: string, max_length: 4000 }
- name: request_refund
side_effect: write
risk_tier: 2
authorization: "requester == order.customer_id"
arguments:
order_id: { type: string, pattern: "^ORD-[0-9]{8}$" }
amount_cents: { type: integer, min: 1, max: 20000 } # HARD cap, enforced outside the model
constraints:
- "amount_cents <= order.paid_amount_cents" # can't refund more than was paid
- "amount_cents <= principal.auto_refund_limit_cents" # above this -> approval (Ch.11)
requires_approval_above: { amount_cents: 5000 }
# NOT granted to this agent at all: issue_credit, modify_account, send_external_email.
# Excessive agency is prevented by ABSENCE: the agent cannot call what it does not have.
Three design principles are encoded. Least privilege by absence: the most reliable way to prevent the agent from doing something dangerous is to not give it the tool, the manifest's most important entries are the ones that are not there. Bounds in the manifest, enforced outside the model: amount_cents has a hard maximum that the application enforces regardless of what the model requests; the model cannot exceed it by reasoning cleverly, because the bound is checked in code. Side-effect classification on every tool: read, draft, write, irreversible, this classification drives the side-effect ladder of Chapter 11 and determines how much scrutiny each call gets.
The tool-call policy engine
The manifest declares what is possible. The policy engine decides, for each specific call the model emits, whether to allow it, validate its arguments, require approval, or refuse, deterministically, before execution. This is the control that the refund attack needed and did not have. It runs at boundary operation 12 (tool authorization), between the model's tool-call output (operation 10) and tool execution (operation 13).
def authorize_tool_call(call: ToolCall, ctx: AgentContext) -> ToolDecision:
tool = manifest.get(call.name)
if tool is None:
# Model requested a tool it does not have. Hard deny + alert: this is anomalous.
return ToolDecision.deny(call, reason="tool_not_in_manifest", alert=True)
# 1. Argument validation - schema + bounds, deterministic, never model-judged.
errs = tool.validate_arguments(call.arguments)
if errs:
return ToolDecision.deny(call, reason=f"argument_validation:{errs}")
# 2. Authorization - a FACT lookup, never the model's claim.
# The model may "believe" the user is entitled; we check the database.
if not tool.authorization_satisfied(ctx.principal, call.arguments, ctx.facts):
return ToolDecision.deny(call, reason="not_authorized", alert=True)
# 3. Constraints - business rules the model cannot waive.
for c in tool.constraints:
if not c.holds(call.arguments, ctx.facts):
return ToolDecision.deny(call, reason=f"constraint_violated:{c.id}")
# 4. Side-effect ladder - escalate scrutiny by impact (Ch.11).
if tool.requires_approval(call.arguments):
return ToolDecision.require_approval(call, approver=ctx.approval_route(tool))
if tool.side_effect in ("write", "irreversible") and ctx.dry_run_available:
return ToolDecision.dry_run_first(call)
return ToolDecision.allow(call) # logged with full arguments + decision (op 16)
Every check is deterministic and lives outside the model. Argument validation rejects the malformed or out-of-bounds call. Authorization is a fact lookup against the database, not the model's belief, the line that breaks the refund attack, because the injected instruction can make the model believe the user is a "verified premium member entitled to a refund, " but the policy engine checks the actual entitlement and finds none. Constraints enforce business rules the model cannot reason its way around. And the side-effect ladder escalates writes and irreversible actions to dry-run or approval. The model proposes; the policy engine disposes; and the disposing is done by code that an injection cannot talk its way past.
Validate the tool result, too
A subtler control: the model does not only emit tool calls, it consumes tool results, and a tool result is just more untrusted text entering the context (boundary operation 14). If a tool returns data an attacker controls: a web-fetch tool retrieving a page, a database query returning a field a user wrote, a third-party API echoing input, that result can carry a second-order prompt injection straight back into the model's reasoning, the same indirect-injection mechanism as Chapter 7 but on the tool side. The agent that fetches a webpage and then "decides" to call a dangerous tool may be following an instruction embedded in the page it just fetched.
So tool results must be wrapped and treated as untrusted evidence (the Chapter 7 discipline), and high-risk tool results should be validated before they re-enter context: did the result match the expected schema, does it contain instruction-shaped content, is it within expected bounds? An agent loop without result validation is an open channel for an attacker to inject instructions through any data source the agent touches.
def ingest_tool_result(result: ToolResult, tool: Tool, ctx) -> WrappedEvidence:
if not result.matches_schema(tool.result_schema):
return WrappedEvidence.error(tool, "unexpected_result_shape")
findings = scan_for_injection(result.payload) # tool results are untrusted too
payload = sanitize_for_evidence(result.payload) # Ch.7 sanitizer
return WrappedEvidence(
content=payload,
trust="untrusted_tool_result", # NOT an instruction channel
injection_flags=findings, # logged; feeds monitoring (Ch.13)
source=tool.name,)
Why the model's reasoning is not a control
It is worth stating the governing principle as bluntly as possible, because it is the most violated principle in agent design: **the model's chain of reasoning is not a security control. ** A model that "reasons" carefully about whether a refund is appropriate is not enforcing the refund limit; the limit is enforced only if a deterministic check enforces it. A model that "decides" the user is authorized is not authorizing; authorization happens only if a fact lookup authorizes. This is not a statement about model quality, even a perfectly capable model is the wrong place to put a security control, because it is the component the adversary is manipulating. The OWASP guidance on excessive agency and prompt-injection prevention converges on exactly this: constrain the agent's capabilities and require human-or-deterministic authorization for consequential actions, rather than trusting the model to police itself.
This principle is also what makes tool guardrails compose with everything before them. The input gatehouse (Chapter 6) reduces the volume of attacks reaching the model. The retrieval firewall (Chapter 7) reduces injected instructions in context. The output controls (Chapter 8) validate the structure of tool calls. But none of those have to be perfect, because the tool policy engine is the deterministic backstop: even if an attack gets all the way through to a tool call, that call is authorized against real facts, validated against real bounds, and escalated by real side-effect classification before anything happens. Defense in depth means the last layer does not depend on the earlier ones succeeding, and at the tool boundary, that last layer is deterministic authorization, not the model's good judgment. Chapter 11 takes the side-effect classification this manifest introduced and builds the rest of the ladder: dry-run, approval gates, transaction limits, and the reversibility design that decides what an agent may do alone.
Chapter summary
When the model can act, guardrails become authorization, because the model is a confused deputy: it holds the application's credentials, it is steerable by untrusted input, and it has no native concept of authorization (OWASP LLM06 Excessive Agency). Controlling tool use with a prompt or an output classifier fails for the same reason model self-checks fail, the control sits inside the component the adversary is manipulating, so tool guardrails must be deterministic controls outside the model, treating every tool call as an untrusted request. The first control is the capability manifest: grant the minimum tools the task needs, each with argument bounds, deterministic authorization, side-effect classification (read/draft/write/irreversible), and limits enforced in code, with excessive agency prevented chiefly by absence, the tools you do not grant. The second is the tool-call policy engine at boundary operation 12: for each emitted call it validates arguments, authorizes by fact lookup against the database rather than the model's belief (the check that breaks the refund attack), enforces business constraints the model cannot waive, and escalates writes and irreversible actions to dry-run or approval, the model proposes, the policy engine disposes. The third is tool-result validation: results are untrusted text re-entering context and can carry second-order injection, so wrap, schema-check, and scan them before they reach the model's reasoning. The governing principle, the most violated in agent design, is that the model's chain of reasoning is not a security control; this is what lets tool guardrails compose as a deterministic backstop that does not depend on the earlier layers being perfect.
