Alpesh Nakrani · Blog

Principles of Building AI Agents That Hold in Production

Alpesh Nakrani — Wed, 17 Jun 2026 18:30:00 GMT

The principles of building AI agents do not live in any framework. They sit underneath all of them: bound the autonomy, name what you will never delegate, evaluate continuously, and design memory that is honest about what it knows. Learn those four and you can pick up LangGraph, CrewAI, or the OpenAI Agents SDK in an afternoon. Skip them and the framework will not save you.

I have watched this play out from both seats. As an engineer I have traced agent runs at 3am to find out why a "working" system did something nobody asked for. As an operator I have signed off on the budget for those systems and answered for them when they failed. The frameworks change every quarter. The reasons agents break in production have not changed at all.

Here is the number that should anchor every design decision. If each step in an agent's loop succeeds with probability p, an n-step task succeeds end to end with probability p^n. At 95% per-step accuracy, a 20-step task completes only about 35% of the time. The compounding math is unforgiving, and it explains why Gartner predicts more than 40% of agentic AI projects will be canceled by the end of 2027, citing escalating costs, unclear business value, and inadequate risk controls. The principles below are how you fight the exponent. This guide expands on the narrow-band argument in my pillar field guide to AI agents and agentic workflows.

The frameworks change every quarter. The reasons agents break in production have not changed at all.

Key takeaways

If you read nothing else, read these.

Reliability compounds downward. A 95% per-step agent completes a 20-step task only about 35% of the time. Short loops beat long ones, every time.
Bound autonomy first. Give the agent the smallest scope that does the job, then widen it only when your evals earn the right.
Name what you never delegate, and enforce it in the tool layer. Prompts are advisory and models are persuadable; a guardrail that does not exist in the tool set cannot be talked around.
"A human reviews it" is not an evaluation plan. Score the trajectory, not just the final answer, and run the harness on live traffic forever.
Memory must be honest about its own freshness. An agent that cannot say "I do not know" will confidently act on stale or invented facts.

Bound the autonomy before you extend it

The first principle: give an agent the smallest possible scope that still does the job, then widen it only when your evals earn the right. Autonomy is not a virtue. It is a liability you take on deliberately, in exchange for a specific payoff.

The reasoning is the compounding math again. Every additional step the agent decides on its own is another factor of p in p^n. A bounded task with a clear start and stop state keeps n small. "Triage these 200 tickets and draft a response queue" is bounded. "Run our support function" is not. Bounding the task is the single cheapest way to raise reliability, and most teams skip it because the demo looked impressive at full scope.

Concretely, bounding means three things: a defined input and output, a tool set scoped to least privilege, and a step ceiling that stops the loop instead of letting it wander. An agent that drafts emails does not get a send key. An agent that reads a database connects to a read replica, not the primary.

A founder I advised, Maya, wanted an agent to run onboarding for her SaaS end to end: read each signup's setup, configure the workspace, email the customer, and book a call. It demoed beautifully and broke in week two, when it misread a plan field, configured the wrong billing tier, and emailed a customer a quote that was off by a factor of three. We cut the scope to one bounded, reversible job: draft the onboarding email for a human to send. That version has run for four months without an incident. The whole task looked like an agent; the part that earned its keep was a sliver of it.

Failure mode when ignored: the agent runs longer than you expected, takes an action you cannot undo, and reports success anyway because it has no mechanism to surface its own error. By the time you notice, the blast radius is whatever you handed it. The OWASP Top 10 for Agentic Applications, released in late 2025, names this surface directly under goal hijack and memory poisoning. Bounded scope is what keeps the worst case small. If you want a team that scopes agents this way from day one, you can hire AI engineers who have shipped them in production.

Name what you will never delegate

The second principle is the one that ties directly to revenue, and it is the one vendors never write about. Decide, in advance and in writing, which decisions a human always makes. Then enforce it in the tool layer, not in the prompt.

This is the both-seats principle. From the engineering seat, a human checkpoint is a control surface. From the revenue seat, it is an accountability boundary. When an agent sends a discount, signs a contract clause, or emails a customer, someone owns the consequence. If you cannot name who owns it, the agent should not be allowed to do it. At Devlyn we run this as a standing rule: a senior person is always in the loop for anything that touches a customer relationship or a material financial outcome. The agent triages; the human decides.

The reason to enforce this in the tool layer is blunt: prompts are advisory and models are persuadable. Cisco's security team argues that prompt injection is the new SQL injection, and guardrails are not enough, because a model processes system instructions, user input, and retrieved context as one continuous stream of tokens with no architectural way to tell trusted commands from untrusted data. A guardrail written into the system prompt can be talked around. A guardrail that simply does not exist in the agent's tool set cannot.

If you cannot name who owns the consequence of an action, the agent should not be allowed to take it.

Picture the difference in one team. A fintech group I worked with had an agent that could issue account credits up to a cap, gated only by a system-prompt rule that said "never exceed $50 without approval." A crafted support message walked it past the rule twice in a week. We moved the cap into the tool itself: the credit endpoint refused any amount over the threshold and returned a flag for human approval instead. The same prompt-injection attempts hit the new design and did nothing, because the dangerous action no longer existed in the agent's reach.

Failure mode when ignored: the agent does something irreversible and expensive, and your post-mortem discovers there was never a clear answer to "who was supposed to approve this?" That is not a model failure. That is a design failure, and it lands on the P&L. The narrow band where autonomy earns its keep is the subject of my honest accounting of what agents can do today and the book it grew into, Agents That Actually Work. The discipline of naming undelegated decisions is the whole argument of my book Human in the Loop Is Not a Plan.

Evaluate continuously, not at launch

The third principle: "a human reviews it" is not an evaluation plan. Build a harness that scores the agent on the failure modes that actually cost you, and run it before every change and on a sample of live traffic forever.

The evidence here is stark, and it comes from production rather than the lab. One March 2026 reliability report analyzed more than 4.4 million tests across 6,259 deployed agents in ten regions and found an aggregate success rate of 56.6%, far below the near-perfect numbers those same agents post on benchmarks. The agents did not get worse in the field. The evaluation was measuring the wrong thing. A lab benchmark rewards average-case behavior on clean inputs; production punishes the tail, where requests are ambiguous and APIs are flaky.

A useful agent eval looks different from a single-prompt eval. You score the whole trajectory, not just the final answer. Did the agent call the right tools in a sensible order? Did it escalate when its confidence was low? Did it stop when it should have stopped? You also track behavioral drift over time, because a model update upstream can change a behavior you depended on without anyone announcing it. This is where evaluation and the human checkpoint meet: a sampled human review is not the whole plan, but it is the labeled data that keeps the automated harness honest.

# A graded agent run, scored on the trajectory, not just the answer

RUN ticket-triage agent=v4 model=mid-tier steps=5

step 1 tool=search_kb ok latency=410ms

step 2 tool=classify ok conf=0.88

step 3 tool=check_refund ok matched=true

step 4 action=flag_for_human ok # correct: never auto-decide refunds

step 5 action=stop ok

EVAL trajectory="pass" cost=$0.009 p95_latency=2.1s human_escalation=correct

Failure mode when ignored: the suite passes every check right up until launch, and a customer finds the regression for you. Anthropic's own guidance on building effective agents leans on transparency for this reason: show the agent's planning steps so a human can inspect them. You cannot evaluate reasoning you cannot see. If you would rather have observability and evals built in from the start than bolted on after an incident, that is the work Devlyn does on AI observability and monitoring.

Design memory that is honest about what it knows

The fourth principle is the most underrated: an agent's memory should be honest about its own confidence and freshness, or it will confidently act on stale or invented facts. Memory is where reliable-looking agents quietly go wrong.

Working memory lives in the context window and is ephemeral. Episodic memory is the structured log of what happened in past runs. Semantic memory holds the facts and preferences that should persist. The honesty problem shows up at the boundaries. An agent told to escalate issues open more than 48 hours fails to escalate because it reads the timestamp inconsistently. An agent told not to double-message a customer cannot check reliably because the sent-message log lives in another system with a timezone bug. These are not exotic edge cases. They fail in the first week.

Honest memory means every retrieved fact carries provenance and freshness, and the agent is built to say "I do not know" instead of filling the gap. Anthropic treats context engineering in 2026 as a first-class design problem rather than a prompt afterthought, warning that context windows of every size are subject to pollution and relevance decay. The deeper architecture for it is the subject of my guide to memory systems for agents, and the retrieval side, where memory meets RAG, is where Devlyn's work on RAG and knowledge integration earns its keep.

Failure mode when ignored: the agent retrieves a stale record, treats it as current, and takes an action that was correct last week and wrong today. Because the memory had no concept of its own staleness, nothing flagged it. The agent looked reliable right up until the moment it was not.

A note on frameworks

None of these principles tell you which framework to use, and that is the point. Anthropic's advice to keep agents simple and prefer composable patterns over heavy abstraction holds because the framework is the easy part. The hard part is the four constraints above, and they are framework-agnostic. Start with the smallest thing that works, add abstraction only when the system demands it, and let your evals decide when to widen scope.

There is a real trade-off here worth naming. Bounding autonomy and enforcing human checkpoints makes agents slower and less impressive in a demo. You give up the long, autonomous run that wows a stakeholder. What you get back is a system that holds at 3am and does not put your name on a mistake you never approved. In production, that trade is not close.

Frequently asked questions

What are the core principles of building AI agents?

Four durable principles sit beneath every framework: bound the autonomy to the smallest scope that does the job, name in writing what a human will never delegate and enforce it in the tool layer, evaluate continuously on real failure modes rather than only at launch, and design memory that is honest about its own freshness and confidence. Frameworks come and go; these constraints decide whether the agent holds in production.

Why do AI agents fail in production?

Mostly because of compounding error. If each step succeeds with probability p, an n-step task succeeds with probability p^n, so a 95% per-step agent completes a 20-step task only about 35% of the time. Production reliability data backs this up: one large 2026 study of deployed agents measured an aggregate success rate near 57%, well below their benchmark scores, because evals measured average-case behavior while production punishes the tail.

How much autonomy should an AI agent have?

As little as the job requires. Autonomy is a liability you take on deliberately in exchange for a specific payoff, not a default. Start bounded, with a defined input and output, least-privilege tools, and a step ceiling. Widen scope only when your evaluation harness shows the agent is reliable at the current scope.

How do you stop prompt injection from breaking an AI agent?

You cannot fully patch it at the prompt layer, because a model reads trusted instructions and untrusted data as the same stream of tokens. The durable defense is architectural: do not give the agent a tool that can take a dangerous action in the first place. Cap risky operations inside the tool itself, require human approval for anything irreversible, and treat every guardrail you can only express in the system prompt as advisory rather than enforced.

What is the best framework for building AI agents?

The wrong question to lead with. Anthropic and most production teams now recommend starting with simple, composable patterns and adding framework abstraction only when the system demands it. The framework is the easy part; bounding autonomy, enforcing human checkpoints, evaluating continuously, and designing honest memory are what actually determine reliability.

If you are turning these principles into a system that has to hold under real load, that is the work my team does every day. See how a Devlyn AI engineering team ships agents with bounded scope, named checkpoints, and evals built in from day one. The principles are free. Putting them into production reliably is the job.

How to Build an AI Agent (the Loop That Holds)

Alpesh Nakrani — Tue, 16 Jun 2026 18:30:00 GMT

To build an AI agent, build a loop. Here is the recipe in five moves: spec the task and its success criteria, give the model a small set of bounded tools, enforce guardrails in code, wire evals against real traces, then ship it behind a human gate. That is the whole thing. Everything else is a framework choice, and the framework is the easy part.

I am going to walk through how to build AI agents the way I actually ship them, from both seats. As an engineer I have traced agent runs at 3am to find out why a "working" system spent forty dollars looping on a malformed tool call. As an operator I have signed the budget for those systems and answered for them when they sent something a customer was not supposed to see. The reflexes below are what separate a demo that wows a stakeholder from a system that holds. This is the build-level companion to my field guide to agentic workflows, which makes the case for when an agent is the right tool at all.

An agent is a loop: plan, call a tool, read the result, decide whether to stop. The work is making that loop safe to leave running.

The reason to take the loop seriously is arithmetic. If each step succeeds with probability p, an n-step task succeeds end to end with probability p^n. At 95% per-step accuracy a 20-step task completes only about a third of the time, and the compounding math is unforgiving: five steps land around 77%, twenty steps around 35%, fifty steps around 7%. Most of this guide is about fighting that exponent.

Key takeaways for building AI agents

If you read nothing else, read these.

An agent is a loop, not a framework. Plan, call a tool, read the result, decide whether to stop. You can write it by hand in under a hundred lines, and doing so once shows you what every framework is doing for you.
The spec is the program. If you cannot state the success criterion as something you could test, you are not ready to build. Input, output, allowed tools, stop condition.
Guardrails belong in code, not the prompt. A prompt is advisory and a model is persuadable. Step ceilings, spend budgets, and tool permissions live in the code that wraps the model call.
You score the trajectory, not just the answer. Build the eval harness from real traces and run it on every release, because an upstream model update can shift a behavior you depended on.
Reliability compounds downward. Bound the task so the step count stays small, gate the irreversible actions behind a human, and widen scope only when evals earn it.

If you are weighing whether a process should be an agent at all before you build it, the narrow-band framework in my book Agents That Actually Work is the faster read. The rest of this piece assumes you have decided to build.

Start with the spec, not the framework

Step one is not picking LangGraph or the OpenAI Agents SDK. Step one is writing down, in plain language, what the agent does and how you will know it worked. If you cannot state the success criterion as something you could test, you are not ready to build the agent.

A usable spec has four parts: the input the agent receives, the output it must produce, the tools it is allowed to touch, and the explicit stop condition. "Triage inbound support tickets and draft a response for each, escalating anything about billing" is a spec. "Handle support" is a wish. The first one tells you exactly what to evaluate. The second one guarantees you will argue about whether it worked.

This is the same move I make everywhere, because the spec is the source of truth for any AI-Native system, an idea I argue in my AI-Native thesis. The spec is what your evals score against later. Write it first and the rest of the build has a target. Skip it and you are tuning a system with no definition of done, the failure I keep coming back to in the principles of building AI agents.

Build the loop, see the trace

An agent is an LLM calling tools in a loop until it reaches a goal or hits a budget. Anthropic's definition is deliberately plain, and I keep it plain on purpose. Before you reach for a framework, write the loop once by hand so you can see every turn. Here is the shape of it, with the trace it should print.

# the loop, stripped to its bones

while step < MAX_STEPS and not done:

reply = model.respond(messages, tools=ALLOWED)

if reply.tool_call:

result = run_tool(reply.tool_call) # validated, idempotent

messages.append(result)

else:

done = True

step += 1

Now run it and watch the trace, because the trace is the product. A good agent log is legible at a glance and tells you what the model decided and why.

# one agent run, abbreviated trace

[1] plan "classify ticket, then draft reply"

[2] tool get_ticket(id=8842) -> "category: billing"

[3] guard billing -> ESCALATE (no auto-reply on billing)

[4] stop handed to human queue, step=3/12

The honest part: most of what you build is not the model call. It is the plumbing around line three of that trace. The validation, the budgets, the escalation rule, and the logging are the agent. The model.respond call is one line. Teams who think the model is the hard part ship demos. Teams who think the loop is the hard part ship systems.

Give it bounded tools, not a toolbox

Tools are how an agent acts, and the cheapest reliability win is to give it fewer of them. Selection quality drops as the option count climbs, so a tight, well-documented tool set beats a sprawling one every time. Each tool is an API you own: validate its inputs and outputs, make its side effects idempotent, and put a time and cost budget on it.

Bounding tools is also where engineering meets the P&L. An agent that drafts emails does not get a send key. An agent that reads customer data connects to a read replica, not the primary. The point is not theoretical tidiness. It is that the blast radius of a bad decision is exactly the set of tools you handed over. Scope the tools and you scope the worst case.

The protocol layer has settled enough to lean on. The Model Context Protocol gives you one way to expose tools that most agents can consume, so you implement an integration once instead of per framework. That is a real convenience. It does not change the rule: every tool the agent can reach is a decision you have delegated, and you should be able to name why. The deeper patterns for this live in the agentic design patterns that hold in production.

Put guardrails in code, not in the prompt

Here is the principle I will not bend on: enforce guardrails in code, not in the system prompt. A prompt is advisory, and a model is persuadable. A guardrail that lives in the prompt can be talked around. A guardrail that does not exist in the tool layer cannot.

Concretely, that means three brakes on the loop. A step ceiling that stops the agent instead of letting it wander. A spend budget that halts the run when token or tool cost crosses a line. And hard exit conditions, including a self-correction step where a failed tool result is revised mid-run rather than passed downstream to poison the next decision. None of these live in the prompt. They live in the code that wraps the model call.

A guardrail in the prompt can be talked around. A guardrail that does not exist in the tool layer cannot.

Failure mode when you skip this: the agent treats a malicious instruction buried in retrieved data as a real command, because an LLM has no built-in way to tell trusted instructions from untrusted content when both arrive as the same tokens. That is why prompt injection is now treated as an architectural constraint, not a bug you patch. The defense is structural. The agent can only do what its tools permit, and the dangerous tools are not in the set.

Wire evals from real traces, run them on every release

"A human reviews it" is not an evaluation plan. Build a harness that scores the agent on the failure modes that actually cost you, derive the cases from real traces, and run it on every release so a regression fails the build before a customer finds it.

Agent evals look different from single-prompt evals because you score the whole trajectory, not just the final answer. Did the agent call the right tools in a sensible order? Did it escalate when its confidence was low? Did it stop when it should have stopped? Anthropic's guidance on agent evals makes the same point: good evals make behavioral changes visible before they reach users, which matters because a model update upstream can shift a behavior you depended on without anyone announcing it.

The thing that should scare you into doing this is the gap between lab and production. An agent can clear a benchmark and still fail on real traffic, because the benchmark measures average-case behavior while production punishes the tail. The agent did not get worse; you simply never measured the cases that break it. The fix is an eval suite built from your own traffic, which is exactly the discipline behind evals for agents that predict production. For the framing under it, my honest accounting of what agents can do today and the book it became, Agents That Actually Work, both start from the same premise: evaluation is the job.

This is also the point where most teams discover they wired evals too late. If you would rather have the harness and the observability built in from day one than reconstructed after an incident, that is the work a Devlyn AI engineering team does, with the observability and monitoring sitting under the agent from the first deploy rather than bolted on after the first customer complaint.

Ship behind a human gate

The last step is the one that decides whether you sleep. Ship the agent behind a human gate at the decisions you will never delegate, and enforce the gate in the tool layer where it cannot be skipped.

Modern frameworks make this concrete. The OpenAI Agents SDK lets a tool declare that it needs approval, which pauses the run, surfaces the pending action, and resumes only after a human approves or rejects it. The agent triages; the human decides. That is the thesis in one mechanism: the machine does the work, and the human evaluates.

From the revenue seat this gate is an accountability boundary, not a nicety. When an agent sends a discount, edits a contract, or emails a customer, someone owns the consequence. If you cannot name who owns it, the agent should not be allowed to do it. The trade-off is real and worth saying plainly: a human gate makes the agent slower and less impressive in a demo. You give up the long autonomous run that wows a stakeholder. In exchange you get a system that does not put your name on a mistake you never approved. In production that trade is not close.

This is also why I am skeptical of starting with elaborate multi-agent swarms. A narrow, observable workflow that succeeds 99% of the time and escalates the rest beats a clever multi-agent system that collapses in week two. Bound it, gate it, then widen scope only when your evals earn it. The durable version of that argument is in the principles of building reliable AI agents.

Where this breaks

Being honest about the failure surface is part of the build. Three places break most often, in my experience.

Memory breaks first. One support agent I reviewed was told to escalate any ticket open more than 48 hours, and it quietly missed a third of them because the sent-message log it read lived in a second system with a timezone bug. Stale facts read as current facts, and the agent acted on them with full confidence. Second, tool sprawl: every tool you add lowers selection accuracy and widens the blast radius. A team I advised had an agent that handled five tasks at 96% reliability, then watched it slide toward the low 80s once they pushed it to fifteen, because the model now chose wrong among too many lookalike tools. Third, the silent regression, where an upstream model update changes a behavior and nothing flags it until a customer does. That last one cost one team a week of wrong refund quotes before anyone noticed, which is the entire argument for continuous evals over a one-time launch check.

None of these are exotic. They show up in the first week of real traffic, not the edge of the distribution. If your framework picks, that is fine; the honest comparison of agentic AI frameworks from production walks through what each one costs you here.

Frequently asked questions

How do you build an AI agent step by step?

Build it as a loop in five moves. First, write a spec with a testable success criterion: input, output, allowed tools, and stop condition. Second, build the plan-act-observe loop and make the trace legible. Third, give it a small set of bounded, validated tools. Fourth, enforce guardrails in code, including step ceilings, spend budgets, and exit conditions. Fifth, ship behind a human gate at the decisions you will never delegate, with evals running on every release. The framework you use to assemble these is the least important choice.

Do I need a framework like LangGraph or the OpenAI Agents SDK to build an agent?

No. An agent is an LLM calling tools in a loop, and you can write that loop by hand in well under a hundred lines. Doing it by hand once is the fastest way to understand what the framework is doing for you. Reach for a framework when you need its specific features, such as built-in human-in-the-loop approvals, handoffs between specialist agents, or streaming. Pick the abstraction the system demands, not the one with the best landing page.

Why do AI agents fail in production?

Mostly compounding error. If each step succeeds with probability p, an n-step task succeeds with probability p^n, so a 95% per-step agent completes a 20-step task only about a third of the time. A second cause is the gap between lab and production: an agent clears a benchmark that measures average-case behavior, then meets the tail cases the benchmark never tested. The fix is to bound the task so n stays small, then evaluate continuously on your own real traffic.

How do you keep an AI agent safe to run unattended?

Enforce limits in code, not in the prompt. Give the agent the smallest tool set that does the job, with no access to irreversible actions. Put a step ceiling and a spend budget on the loop. Gate any high-stakes action behind human approval at the tool layer, so it pauses and waits rather than acting. Then run evals built from real traces on every release so a behavior regression fails the build before a customer hits it.

If you are turning this loop into a system that has to hold under real load, that is the work my team does every day. See how a Devlyn AI engineering team builds AI agents with bounded tools, guardrails in code, and evals wired in from day one. The loop is free to write. Making it safe to leave running is the job.

Agentic AI Frameworks Compared (From Production)

Alpesh Nakrani — Mon, 15 Jun 2026 18:30:00 GMT

There is no single best agentic AI framework. The right one is the one whose costs you can live with: how much control it takes from you, how much observability it gives back, and how much lock-in it leaves behind. LangGraph, CrewAI, the OpenAI Agents SDK, and the no-framework option each trade those three things differently. Pick on the trade, not the feature list.

I have shipped agents into real operational workflows where a wrong action lands on a customer, not a slide. The framework choice rarely decided whether the agent worked; the bounded scope and the eval harness did that. But the framework decided how painful month three was: debugging, cost, and the day a model provider changed something under me. This is the comparison I wish I had read before I picked.

A framework does not make your agent reliable. It decides how much it costs you to find out that it is not.

Key takeaways

If you read nothing else, take these five claims with you:

LangGraph buys you control and durable, inspectable state. You pay in boilerplate and a learning curve before the first agent runs.
CrewAI buys you the fastest multi-agent prototype. You pay when role-play abstractions hide what each agent actually did.
The OpenAI Agents SDK buys you the shortest path to working code. You pay in vendor lock-in and a thinner story off OpenAI models.
No framework buys you total transparency and zero abstraction debt. You pay by hand-building retries, state, and tracing yourself.
The real tie is operational: whatever you pick, you still owe an eval harness, cost budgets, and a trace you can read at 3am.

Why "best agent framework" is the wrong question

The question buyers ask is "what is the best agent framework?" The question that predicts production pain is different: what does each one cost me when things break? Frameworks do not compete on whether they can build an agent. They all can. They compete on what they hide and what they expose.

Three costs matter, and they trade against each other. Control is how precisely you can shape the agent's next step. Observability is how clearly you can see what it did after the fact. Lock-in is how hard it is to leave when the provider, the price, or the abstraction stops serving you. A framework that maximizes one usually taxes another. That tension is the whole decision.

This connects to a point I make in my honest accounting of what agents can do today, and it sits downstream of the pillar argument in my guide to agentic workflows that hold in production: the framework is downstream of the spec. Decide what the agent must never do, what it must escalate, and how you will grade it. Then the framework choice gets small, because most of your reliability lives in code you wrote, not in the library you imported.

LangGraph: control and durable state, paid for in boilerplate

LangGraph models an agent as an explicit state machine over a graph. You define nodes, edges, and the state that flows between them. That is more code than the alternatives for a simple agent, and it is the point. When the workflow is non-linear, stateful, and has to survive failure, you want the control surface to be explicit, not implied.

The feature that earns its keep is durable execution. LangGraph checkpoints state after every node to SQLite or Postgres, so a failed run resumes from the last successful node instead of starting over. In a long workflow, that difference is hours of saved LLM compute per failure, per the LangGraph project docs. Pair it with LangSmith and every decision point is inspectable and replayable, which regulated teams need for audit.

The cost is real. Simple agents carry meaningful boilerplate, and the graph mental model takes time to learn. You also adopt the broader LangChain ecosystem, which is a benefit when you use its integrations and a tax when you fight its abstractions. When it works: complex, auditable, recoverable workflows where state and human review matter. The failure mode: reaching for the state machine when a 30-line script would have shipped today.

# LangGraph: state is explicit, so failure recovery is explicit too

graph.add_node("plan", plan_step)

graph.add_node("act", act_step)

graph.add_conditional_edges("act", needs_review, {"human": review, "done": END})

CrewAI: the fastest multi-agent prototype, paid for in opacity

CrewAI composes agents as role-driven crews: a researcher, a writer, a reviewer, each with a declarative task. When your problem maps cleanly to specialist roles, this is the fastest path to a working multi-agent system, often a few hours from zero. The abstraction matches the way people describe the work, which is why prototypes come together quickly.

That same abstraction is the cost. Role-play framing makes it easy to write a crew and hard to know what each agent actually did, what it spent, and where it went wrong. The thing that makes debugging tractable is a trace you can read, and role metaphors can sit between you and that trace. You can instrument around it, but you are adding back the observability the abstraction smoothed over.

There is also a security note worth stating plainly. CrewAI's managed and enterprise tiers carry SOC 2, but the open-source framework ships with no built-in authentication, audit logging, or access controls, so that hardening is on you before it touches regulated data. When it works: genuine multi-agent collaboration with distinct roles and a quick path to a demo. The failure mode: a "crew" of agents doing what one well-scoped agent could do, at several times the cost and latency, a trap I unpack in my pillar guide above.

OpenAI Agents SDK: shortest path to code, paid for in lock-in

The OpenAI Agents SDK is the opinionated, lightweight option. It exposes four primitives: agents, tools, handoffs, and guardrails. You can define a working multi-agent system in under 20 lines of Python, and tracing is on by default, which is a genuinely good developer-experience decision. The built-in tracing records LLM generations, tool calls, handoffs, and guardrail checks without extra wiring.

Handoffs are the headline pattern: one agent transfers control to another and passes the full message history, so the receiving agent sees the whole conversation. Guardrails wrap each interaction with input and output validation. For a team already on OpenAI models, this is the lowest-friction way to ship a structured agent.

The cost is lock-in. The SDK is tightly coupled to OpenAI models, and the story gets thinner the moment you want to route to a cheaper or different provider. That matters more than it sounds. Routing easy steps to a smaller, cheaper model is one of the largest levers on agent economics. A framework that makes provider-switching awkward quietly raises your run cost. When it works: fast prototypes and production systems committed to the OpenAI stack. The failure mode: discovering the switching cost after the architecture has hardened around one vendor.

No framework: total transparency, paid for in plumbing

The no-framework option is more credible than it sounds. An agent is a language model that calls tools and remembers context in a loop, and the core pattern is small: prompt, tool call, observe, repeat. Most frameworks are error handling, state, and tracing layered on top of that loop. You can write the loop yourself against a provider's tool-use API and keep every prompt and every decision in plain sight.

The advantage is transparency with no abstraction debt. When something breaks, you read your own code, not a library's internals. For a bounded agent with a handful of tools, this is frequently the right call, and it is what I reach for first when the scope is small and the stakes are high. The principles that make it hold are in my notes on building reliable AI agents.

The cost is that you build the plumbing yourself: retries, checkpointing, structured tracing, and concurrency. That is fine at small scope and painful at large scope, which is exactly the point where a framework starts paying for itself. When it works: bounded agents, few tools, teams that value control over speed. The failure mode: reinventing durable execution badly, six months in, when you should have adopted LangGraph.

Agentic AI frameworks compared: the trade-off table

Here is the honest comparison across the four options. Read it as costs, not scores. There is no row where one framework wins everything, and the operational row is a deliberate tie.

Dimension	LangGraph	CrewAI	OpenAI Agents SDK	No framework
Control over each step	Highest (explicit graph)	Medium (role-driven)	Medium (handoff chains)	Total (it is your code)
Time to first working agent	Slowest (boilerplate)	Fastest (hours)	Fast (under 20 lines)	Medium (you write the loop)
Observability out of the box	Strong via LangSmith	Weakest; instrument it	Tracing on by default	None; you build it
Durable / recoverable state	Yes, checkpointed per node	Limited	Limited	Only if you build it
Provider lock-in	Low (any LLM)	Low (any LLM)	High (OpenAI-coupled)	None
Operational burden (evals, cost, trace)	Still on you	Still on you	Still on you	Still on you

Notice the last row. Every framework leaves the same bill on your desk: evaluate it, budget it, and trace it yourself.

LangGraph vs CrewAI: the comparison people actually search

The LangGraph vs CrewAI question has a clean answer because the two optimize for different shapes of work. LangGraph wins when the workflow is a process: stateful, non-linear, recoverable, and audited. CrewAI wins when the workflow is a team: distinct roles collaborating, and speed to prototype matters more than step-level control.

The mistake is choosing on vibe. Teams pick CrewAI because role-based design feels intuitive, then hit a wall when they need to inspect exactly what step four did and why it cost what it cost. Other teams pick LangGraph for a linear task that never needed a state machine, and pay the boilerplate tax for nothing. Match the framework to the shape of the work, not to the demo that impressed you. If you are still weighing specific tools, my rundown of the best AI agents and the work behind them covers what separates the ones that hold.

The cost consequence, from the revenue seat

Framework choice is a P&L decision disguised as a technical one. The visible cost is engineering time. The hidden cost is the one that compounds: a vendor-locked SDK that blocks you from routing cheap steps to a cheap model, or an opaque abstraction that turns a one-hour incident into a one-day investigation. Both show up as margin, not as a line in the architecture diagram.

Here is the arithmetic that matters. An agent that resolves a task for $0.04 can beat a human on unit economics, while the same agent locked to a premium model because the framework made switching hard might cost $0.40 and erase the margin. The framework did not write that check directly; it made the cheaper path harder, which is the same thing as slower. Choose for the cost structure you can sustain, not the prototype you can ship Friday.

Seeing that margin in the first place takes instrumentation most frameworks leave to you, which is the work Devlyn does on AI observability and monitoring: per-step cost, latency, and a trace you can actually read during an incident. Without it, the lock-in tax stays invisible until it shows up in the quarter.

Frequently asked questions

What is the best agentic AI framework?

There is no single best agentic AI framework; the right one depends on what you can afford to give up in control, observability, and lock-in. LangGraph suits complex, auditable, recoverable workflows. CrewAI suits fast multi-agent prototypes with clear roles. The OpenAI Agents SDK suits teams committed to OpenAI models. For a small, bounded agent, no framework is often the cleanest choice.

LangGraph vs CrewAI: which should I use?

Use LangGraph when the work is a process: stateful, non-linear, recoverable, and audited, where step-level control and durable state pay off. Use CrewAI when the work is a team of specialist roles and you want the fastest path to a working prototype. LangGraph costs you boilerplate; CrewAI costs you observability into what each agent actually did. Match the framework to the shape of the work.

Do I even need an agent framework?

Often, no. An agent is a model calling tools in a loop, and for a bounded agent with a few tools you can write that loop yourself and keep every prompt visible. A framework pays for itself once you need durable execution, complex state, or built-in tracing at scale. Below that threshold, no framework gives you more transparency and less abstraction debt.

Which agent framework is best for production?

For stateful, auditable production workflows, LangGraph is the most production-ready, with durable checkpointing and replayable traces via LangSmith. But production-readiness is mostly not the framework. Whatever you pick, you still owe an eval harness that grades the agent's trajectory, explicit cost and latency budgets, and a trace you can read during an incident. The framework decides how hard those are to add, not whether you need them.

Where this leaves you

Pick an agentic AI framework by its costs, not its feature list. LangGraph trades boilerplate for control and durable state; CrewAI trades observability for prototype speed; the OpenAI Agents SDK trades lock-in for the shortest path to code; no framework trades plumbing for total transparency. None of them give you reliability; that comes from a bounded spec and an eval harness you build either way.

If you want the deeper framework for deciding what should be an agent at all, my book Agents That Actually Work walks the narrow-band approach with production examples. And if you want a team that ships agentic systems with evals and cost discipline built in from day one, see how Devlyn approaches hiring AI engineers who have done this in production. The honest path is also the cheaper one: choose for the cost you can sustain, prove it with evals, and switch when the numbers say you should.

Agentic AI Examples: What's Genuinely Shipping

Alpesh Nakrani — Sun, 14 Jun 2026 18:30:00 GMT

The agentic AI genuinely shipping in 2026 clusters in four categories: coding agents, research and browser agents, customer-ops agents, and data and ops automation. These are not demos. They are systems running against real users, with real numbers attached, and a few public failures that cost money. Below are concrete, dated agentic AI examples, including the ones that broke and why.

I have spent two years putting agents into production from the seat where engineering meets revenue. So I care less about what an agent could do in a keynote and more about what it does at 3am against messy input. The pattern is consistent. Agents earn their keep in a narrow band, then the human moves to the end of the loop to evaluate the output. That is the thesis, and the real-world agentic AI cases below show exactly where it holds and where it snaps.

An agentic AI example is only useful if it comes with a date, a number, and an honest account of where it failed.

Key takeaways

If you read nothing else, read these.

Coding agents are the most mature category. Frontier coding agents cluster near the top of SWE-bench Verified, which has saturated to the point of needing harder successors like SWE-bench Pro, where scores fall sharply.
Customer-ops agents work, then over-reach. Klarna's assistant handled two-thirds of chats in month one, then the company hired humans back in 2025 after cutting too far.
Browser and computer-use agents still trail humans badly. OpenAI's computer-use agent scored about 38% on OSWorld against a human baseline near 72%.
A wrong answer is a contractual liability. A tribunal held Air Canada responsible for a refund its chatbot invented, in February 2024.
The shipping examples share one trait. The task is bounded, the failure is catchable, and a human evaluates the output before it reaches a customer.

Coding agents: the most mature agentic AI examples in production

Coding agents are the clearest real-world agentic AI in production today. They read a repository, plan a multi-file change, run the tests, and iterate on failures. This is genuine agentic behavior: a loop of action, observation, and correction, not single-shot generation.

The numbers got serious fast. By mid-2026, the leading coding agents from Anthropic, OpenAI, and others sit clustered near the top of SWE-bench Verified, the standard issue-resolution benchmark, with several frontier models statistically tied around 80-90%. That clustering is the story. When six models from four labs land within a few points of each other, the benchmark has stopped discriminating at the frontier, which is why harder successors like SWE-bench Pro now exist. On those tougher sets, scores fall sharply. That gap between benchmarks is itself the honest signal: the easy issues are solved, the messy ones are not.

If you are weighing which coding agent to put in front of your repository, start from the failure cost, not the leaderboard. I walk through that lens in the pillar on AI agents and agentic workflows, and the durable framework for it lives in Agents That Actually Work.

The shipping pattern is the agent in CI. Cursor and GitHub Copilot both run an agent inside a clean VM that clones the repo, makes the change, opens a draft pull request, and fixes the build until it goes green.

# the loop a coding agent runs, end to end

clone repo --branch agent/fix-1842

run tests "pytest -q" # observe failures

edit 3 files, push draft PR #4471

iterate until CI green, then request human review

Here is the trade-off I will not pretend away. The agent closes the draft PR; a person still merges it. The harness around the model matters as much as the model. I have watched a coding agent pass every test and still ship a change that broke a downstream contract nothing in the suite covered. The agent did the work. The review caught the miss. The revenue point is direct: a merged regression in a billing path costs more than a week of the agent's compute. The whole approach to picking the best AI agents for code should start from what its failure mode costs you, not its benchmark.

Research and browser agents: fast drafts, slow trust

Research and browser agents are the second category of agentic AI examples, and the most over-sold. They search, read, click, and synthesize across the web. Deep-research modes in ChatGPT, Claude, and Perplexity will produce a sourced literature review in minutes. The draft is genuinely useful. The citations are not always trustworthy.

Browser and computer-use agents are further behind than the headlines suggest. OpenAI's computer-use agent, first shipped as Operator in January 2025 and folded into ChatGPT Agent by August 2025, scored about 38% on OSWorld and 58% on WebArena. Humans score near 72% on OSWorld. A 38% score means the agent fails roughly two of every three real desktop tasks. That is assistive, not autonomous.

A research agent drafts the literature review in minutes and still cites sources that do not say what the summary claims. You verify before you ship.

Where these agents ship successfully, the human stays at the evaluation step. The agent gathers and drafts; a person checks the claims against the cited source before anything goes to a client or a board. Treat the output as a fast first pass, not a finished answer. The cost lens applies here too: the time you save on the draft, you spend on verification, and that is still a net win for the right task.

Customer-ops agents: the Klarna example, in full

Customer-ops is where the most cited real-world agentic AI case lives, and it tells the honest story better than any vendor deck. In February 2024, Klarna reported its AI assistant handled two-thirds of customer service chats in its first month: 2.3 million conversations, the equivalent of 700 full-time agents, with resolution time dropping from 11 minutes to under 2. Klarna estimated the assistant would drive roughly $40 million in profit improvement across 2024.

Then the correction. In 2025, Klarna began hiring humans back. The CEO admitted publicly the company had cut too far, too fast on quality. By 2026 Klarna kept the agent for routine, high-volume queries, still around two-thirds of inquiries, and routed complex and sensitive cases to people. That is not a failure of the agent. It is the discovery of its band.

The failure example is sharper. In February 2024, a tribunal ordered Air Canada to honor a refund policy its support chatbot had invented for a grieving customer. The airline argued the chatbot was a separate entity; the tribunal disagreed and held the company liable for the $483 refund plus fees. When an agent states a policy, that statement is a contractual act. A hallucinated refund rule is not a quality metric. It is a legal exposure.

The lesson across both is the same one in my honest field guide to agentic workflows: customer-ops agents resolve the routine majority and must hand off the rest cleanly. The warm handoff is where the ROI actually lives, and the guardrail on what the agent may promise is what keeps you out of court. If you want a team that designs those handoffs and guardrails before launch rather than after an incident, that is the work Devlyn does when you hire AI engineers who have shipped customer-ops agents in production.

Data and ops automation: the quiet, durable agentic AI examples

The least glamorous category is the most durable. Data and ops agents triage tickets, classify and route alerts, reconcile records, and drive multi-step internal workflows. There is no demo applause here, which is exactly why they last.

The scale is real. By early 2026, Microsoft reported that more than 230,000 organizations had used Copilot Studio to build agents, with over a million custom agents created across SharePoint and Copilot Studio in a single recent quarter, most of them internal ops automation rather than customer-facing. Salesforce reported more than 29,000 Agentforce deals closed since launch, with Agentforce ARR reaching roughly $800 million. IT-support agents now handle ticket creation, classification, and status updates at meaningful autonomy. Reported cost-per-task reductions in support and code-review functions run from severalfold to far higher, though I treat any single vendor's figure as illustrative until I see it in a system I can instrument.

The shape of a durable ops agent is narrow on purpose. It owns one workflow, writes every action to a log a human can audit, and stops at a named decision instead of guessing. A ticket-triage agent that routes confidently and escalates the ambiguous case beats a clever one that resolves everything and is wrong 2% of the time.

# a bounded ops agent: act, log, escalate at the named line

classify ticket #90213 -> "billing dispute"

if confidence < 0.85: escalate to human queue

else: route, log action, await close

The honest counterweight: surveys through 2026 consistently report that the majority of agent pilots never reach sustained production, and a meaningful share of deployments never hit payback. The delta lives in edge-case diversity that only appears at real data volume. A reconciliation agent that is 98% correct sounds great until the 2% lands in a financial close. To see how the durable cases map to business constraints, I work through the matching logic in my piece on agentic AI use cases, and the bounded patterns in the full guide to agentic workflows.

The agentic AI that lasts is boring on purpose: bounded task, catchable failure, a human at the close.

What the shipping examples have in common

Across all four categories, the agentic AI in production that survives shares three traits. The task is bounded, so the agent is not asked to improvise outside its competence. The failure is catchable, by a test, a reviewer, or a guardrail, before it reaches a customer. And a human owns the evaluation step, which is where judgment scales when generation gets cheap.

The examples that became cautionary tales, Air Canada's refund and Klarna's over-cut, broke exactly one of those rules. The agent acted past its band, or no one was positioned to catch the miss. That is the difference between an agent that ships and an agent that makes the news.

FAQ

What is a real example of agentic AI?

A coding agent like Claude Code or Cursor that clones a repository, makes a multi-file change, runs the tests, and opens a pull request is a real, shipping example of agentic AI. It loops through action and correction rather than generating a single answer. Klarna's customer-service assistant and internal ops agents in Microsoft Copilot Studio are other production examples.

Which agentic AI examples have failed in production?

Two are well documented. Air Canada was held liable in February 2024 after its chatbot invented a refund policy a customer relied on. Klarna cut its support team too aggressively after early AI success, then hired humans back in 2025 for complex cases. Both failures came from letting the agent act outside a bounded, supervised task.

Is agentic AI actually in production at scale in 2026?

Yes, in specific bands. Coding agents, customer-ops triage, and internal data and ops automation run at real scale, with Microsoft citing hundreds of thousands of custom agents. But most pilots still do not reach sustained production, and browser and computer-use agents remain assistive, scoring well below humans on benchmarks like OSWorld.

How do I know if an agentic AI example will work for my case?

Check three things: is the task bounded, is the failure catchable before it reaches a customer, and is a human positioned to evaluate the output. If any answer is no, you are looking at a demo, not a production system. Match the use case to the constraint, not the hype.

If you are weighing which of these agentic AI examples maps to a real workflow in your business, that is a build-and-evaluate problem, not a slideware problem. The durable principles behind the patterns are in Agents That Actually Work, and when you are ready to ship one with observability and evaluation built in from day one, that is exactly what a Devlyn pod sets up on AI observability and monitoring. If you would rather have the build owned end to end, hire AI engineers who have shipped agents in production.

Offline vs Online LLM Evaluation: Why You Need Both

Alpesh Nakrani — Sat, 13 Jun 2026 18:30:00 GMT

Offline vs online evaluation is the difference between what you test before release and what you measure after. Offline evaluation gates a deploy: you run a candidate model or prompt against a frozen set of cases with known-good answers, and you block the ship if a metric regresses. Online evaluation measures real behavior once the change is live, through sampled scoring, A/B tests, and guardrail monitors on production traffic. You need both. Offline catches the regressions you introduce. Online catches the distribution shift offline cannot see.

I have watched teams pick one and call it a strategy. The offline-only team passes every check and then eats a quality drop they never modeled. The online-only team finds out about a bad prompt from a customer, which is the most expensive monitoring tool ever built. This piece compares the two honestly: what each measures, when each lies, and how they fit into one harness that gates a real deploy.

Offline catches the regressions you introduce. Online catches the distribution shift offline cannot see. Pick one and you are exposed on the other side.

Key takeaways

If you read nothing else, these are the load-bearing claims:

Offline evaluation is a pre-deploy gate against a frozen set. Hold everything constant except the variable under test, and block the ship on a regression.
Online evaluation measures the live distribution you cannot freeze. Sampled scoring, A/B tests, and guardrail monitors run on real traffic after release.
Offline lies about the world; online lies about cause. A clean offline run says nothing about traffic you never sampled, and an online dip rarely tells you why on its own.
Adoption is lopsided and that is the gap. Industry surveys put offline eval adoption near 52% and online near 37%, so most teams ship blind to drift.

If you are wiring this up right now, my field guide to evals lays out the offline gate and the online monitor as one system rather than two disconnected dashboards. Read on for when each half earns its keep.

What offline evaluation actually gates

Offline evaluation tests a candidate change against a fixed dataset before it reaches a user. You assemble cases with reference outputs, run the new model or prompt against them, score with metrics or a judge model, and compare to the last version. It is the unit test layer for a probabilistic system, and it does one job well: it stops a regression you introduced from shipping (Datadog, 2026).

The discipline that makes offline evaluation honest is freezing the set. You hold the dataset constant across versions so the only thing that moves is your code. The moment you edit the test to make a failing run pass, you have stopped measuring the model and started measuring your willingness to edit the test. I make the full case for the frozen ruler in my essay on evals that predict production, and the same rule governs the harness in my guide to building an LLM evaluation framework in the first place.

A clean offline run is a gate, not a guarantee. It tells you the change did not break the cases you thought to write down. That is genuinely valuable and genuinely limited. The set reflects the traffic, model behavior, and user intent that existed the day you froze it. Production does not hold still for you.

What online evaluation measures that offline cannot

Online evaluation scores real behavior after release. You sample production requests, run an automated evaluator over them, watch quality metrics over time, and trip a guardrail when something crosses a threshold. It also covers A/B tests, where you route a fraction of traffic to a new variant and compare live outcomes rather than reference answers (LangChain, 2026).

Online evaluation exists because the input distribution shifts and your frozen set does not. Three things move underneath you. Users phrase requests in ways your set never sampled. New segments arrive with intents you did not anticipate. And the model provider updates weights behind a stable version string, so behavior drifts without a single line of your code changing. Each of these passes offline and shows up online, or not at all if you are not watching.

The provider-drift case is the one that burns careful teams. Your offline set was labeled against the old behavior. The new weights are subtly different in ways no existing case triggers, but real outcomes degrade. This is the documented pattern behind several silent model updates: evals stayed green while production quality slid (LangChain, 2026). Online sampling is the only place that signal surfaces before a customer does.

Priya, a staff engineer on a support-automation team, lived this in April. Her offline suite ran green every night for three weeks: faithfulness at 0.93, no regression on the 600-case frozen set. Then escalations climbed 18% in a fortnight with no deploy of her own. The cause was a provider point release that nudged the model toward longer, hedged answers her labeled cases never tested for. Offline could not see it because nothing in her code or her set had changed. A 5% online sample caught the drift in two days; without it she would have read about it in the quarterly churn review.

When each one lies

Both methods mislead, and they mislead in opposite directions. Knowing the failure mode of each is what keeps you from over-trusting a green dashboard.

Offline lies about coverage. A passing offline suite tells you nothing about the slice of traffic you never put in the set. It is silent on novelty by construction. The set is a fixed distribution, and production is a live one, so a 0.91 offline score and a quiet production failure coexist without contradiction. Worse, offline scoring leans on judge models that carry their own noise, so a passing margin inside the noise band is not a real pass. I cover that in which eval metrics lie.

Online lies about cause. A live metric dip tells you something changed; it rarely tells you what. Did you ship a worse prompt, did traffic shift, or did the provider move the model under you? Production is probabilistic and multi-step, so isolating the root cause from a trace is hard. Online tells you the patient has a fever. It does not name the infection. You confirm the cause by reproducing the failing cases offline, which closes the loop back to your frozen set.

There is a subtler online trap worth naming. A change that looks better in an offline A/B can still hurt live outcomes, because the offline judge and the real user optimize different things. The 2026 paper "When Generic Prompt Improvements Hurt" documents exactly this: generic prompt edits that raised scores on one contract degraded downstream task success on another, in one case dropping a retrieval task from 26 of 30 passing cases to 9 (arXiv 2601.22025). The offline number went up. The thing you actually sell went down.

Dimension	Offline evaluation	Online evaluation
When it runs	Before deploy, in CI	After deploy, on live traffic
Data	Frozen set, known answers	Sampled production, no ground truth
Primary job	Gate the ship, block regressions	Detect drift and live failures
Methods	Metrics, judge scoring, regression diff	Sampled scoring, A/B tests, guardrails
Blind spot	Traffic it never sampled	Root cause of a change
Failure cost	Misses novel inputs	Reactive, user already hit it

How they fit into one harness

Treat offline and online as two stages of one feedback loop, not two competing tools. Offline gates the release. Online watches the release and feeds new cases back into the frozen set. The loop has a direction, and getting it backward is how teams end up with a test suite that quietly tracks production instead of leading it.

The mechanics are simple to state and easy to skip. Offline runs in CI and blocks merge on any regression past the judge noise band. After deploy, you sample a fixed fraction of production traffic, score it with the same evaluators, and alert when a metric crosses a guardrail. When online diverges from the offline baseline, you pull the failing production traces, label them, and add them to the next frozen version. That is the loop that keeps offline honest about the real world (Datadog, 2026).

Here is what that looks like across a single release, with illustrative numbers rather than data from a live system.

# offline gate, CI, frozen set eval-set-2026-w24-v3.jsonl

python -m eval.offline \

--suite eval-set-2026-w24-v3.jsonl \

--candidate prompt-v8 --baseline prompt-v7

faithfulness 0.92 vs 0.91 baseline # within noise, no regression

verdict PASS # gate opens, deploy proceeds

# online monitor, 5% sampled production traffic, 72h later

faithfulness 0.84 # 7 pts under offline baseline

guardrail TRIP # drift past 3pt budget

action pull 40 failing traces -> label -> add to v4

Read the gap between the two faithfulness numbers. Offline said 0.92 and the gate opened. Online sampled the live distribution and read 0.84. Nothing in the code changed between them, so the 7-point gap is the distribution shift offline could not see. The fix is not to argue with the online number. It is to capture those failing traces and promote them into the next frozen set, so the next offline run can actually catch what this one missed.

If you can only build one first

Most teams cannot stand up both halves in week one, and that is fine as long as you sequence them deliberately. Build the offline gate first. It is cheap, it runs in CI, and it stops the regressions you control, which are the most common way a release breaks. A frozen set of 50 real cases and a pass/fail threshold beat an elaborate online dashboard you have not wired yet.

Then add online sampling within the first month, before your traffic outgrows the set you froze. Start small: sample 1 to 5% of production, score it with the same evaluators your gate uses, and alert on a single guardrail metric. The point of going second is not that online matters less. It is that online without an offline baseline gives you an alarm and no way to prove a fix worked. The gate is what makes the monitor actionable, so build the thing that gives you control before the thing that gives you visibility.

The revenue line

Here is the business consequence in one frame a CRO understands. Offline-only evaluation gives you a clean pre-launch report and a quality cliff you find out about from churn. Online-only evaluation gives you a fast alarm and no way to ship a fix with confidence, because you have no gate to prove the fix worked. The first wastes the deploy you already paid for; the second turns every release into a gamble on live customers.

Run both and the math changes. The offline gate caps the cost of a bad release, because it blocks the obvious regressions for the price of a CI run. The online monitor caps the duration of a drift you could not predict, because you catch it in sampled traffic instead of in a support queue. That second number is the one that compounds. A faithfulness drift caught in 72 hours is an incident; the same drift caught in eight weeks is a renewal you lose. Continuous online scoring against a frozen reference is squarely an AI observability and monitoring job, and it is the half most teams skip.

Frequently asked questions

What is the difference between offline and online evaluation? Offline evaluation runs before deploy against a frozen dataset with known-good answers, and its job is to block regressions. Online evaluation runs after deploy on sampled live traffic, and its job is to detect drift and failures you did not anticipate. Offline gates the ship; online watches what shipped.

Do I need both offline and online LLM evaluation? Yes. Offline catches the regressions you introduce in your own code and prompts. Online catches changes that happen to you, like input distribution shift and silent model-provider updates that pass every offline case. Either one alone leaves a whole class of failures unmonitored.

How does online evaluation work without ground truth? You sample production requests and score them with automated evaluators, usually a judge model against a rubric, plus rule-based guardrails for toxicity or format. You also run A/B tests that compare live outcomes between variants. You confirm any suspected regression by reproducing the failing cases offline, where you do have reference answers.

Why did my offline evals pass but production quality dropped? Your frozen set reflects the traffic and model behavior from the day you built it, and production moved. New phrasings, new user segments, or a provider weight update can degrade real outcomes without triggering a single offline case. That gap is exactly what online evaluation exists to surface.

If you want the full harness this fits into, including frozen sets, judge calibration, and the offline-to-online loop, my book A Field Guide to Evals walks through it end to end. For the surrounding discipline, start with my guide to LLM evaluation, and read how to measure hallucination for the metric most online monitors should watch. If you would rather have a team stand up the offline gate and the online monitor in your stack from day one, that is exactly what Devlyn's AI observability and monitoring work is for. Gate it offline. Then watch it online.

Memory Systems for AI Agents: Remember Without Inventing

Alpesh Nakrani — Fri, 12 Jun 2026 18:30:00 GMT

Memory systems for agents are the design of what an AI agent retains across steps and across sessions. Agent memory has three layers: short-term memory, which is the context window the model reads right now; long-term memory, which is a retrieval-backed store the agent queries for relevant past facts; and episodic memory, which is the record of what happened in earlier runs. The mechanics are the easy part. The hard part is honesty. A memory system that misremembers is more dangerous than one that forgets, because it answers with confidence built on a fact that is no longer true.

So the rule I hold is narrow. Build the smallest memory that closes a measured gap, and instrument it so a stale or invented fact surfaces in an eval before it surfaces in front of a user. Most teams do the opposite. They bolt on persistent memory because it demos well, then discover months later that the agent has been confidently telling customers things that stopped being true in week two.

I write this from two years of putting retrieval-backed systems into production, and from the seat where a wrong answer to a customer is a refund, a churn, or a compliance call. Memory is where agents get genuinely useful and where they get quietly dangerous. Both happen in the same code path. Memory is one slice of the larger question of which agentic workflows actually earn their keep, and it deserves its own honest accounting.

A system that forgets asks you to repeat yourself. A system that misremembers tells your customer something false with full confidence. The second failure costs more.

Key takeaways

If you read nothing else, read these.

Memory is three layers, not one. Short-term is the context window, long-term is a retrieval-backed store, episodic is the log of past runs. They fail differently and you instrument each separately.
A longer context window is not memory. Models attend worst to the middle of long inputs, so stuffing history into the prompt degrades quietly as it grows.
The real failure is confabulation, not forgetting. Stale facts and summarization drift make the agent assert old truths confidently. Errors in evolving memory are cumulative and persistent.
Honest memory needs provenance, freshness, and contradiction checks. Every stored fact carries a source, a timestamp, and a path to be overruled by newer input.
Memory is RAG with a write path. The retrieval discipline that keeps RAG alive applies, plus the new problem of governing what you wrote down.

The three types of agent memory

Short-term memory is the context window. It holds the current conversation, the last few tool results, and the working state of the task in progress. It is fast, exact, and gone when the session ends or the window fills. This is the only memory many agents actually need, and the one teams most often over-engineer past.

Long-term memory is an external store the agent retrieves from. In 2026 the workhorse pattern is a vector or structured store indexed by user, session, and agent, separate from the model. During a run, a memory layer extracts facts and writes them. At the start of the next run, it retrieves the relevant ones by similarity, keyword, and entity match, then injects them into the context window before the model responds. Redis, mem0, and similar tools have standardized this long-term memory architecture around a read-then-write loop, and the shape is consistent across vendors.

Episodic memory is the record of past episodes: what the agent did, what tools it called, what the outcome was. It answers "have I done this before, and how did it go." Most production systems consolidate episodic detail into semantic long-term memory over time, distilling "on March 3 the user asked X and I did Y" down to "the user prefers Y." That consolidation step is exactly where honesty starts to leak, because summarization throws away the provenance that let you check the claim later.

Short-term remembers the conversation. Long-term remembers facts. Episodic remembers what happened. Confuse them and you build a system that knows everything and can verify nothing.

Why a longer context window is not memory

The tempting answer in 2026 is to skip the architecture and use the window. Context windows now run from 128k tokens to several million. Why not keep the whole history in the prompt and let the model sort it out?

Because the model does not read a long context evenly. The well-replicated "lost in the middle" finding shows performance is highest when relevant information sits at the start or end of the input and degrades sharply when the model must use information buried in the middle, even for models built for long context. A 2-million-token window is not 2 million tokens of reliable recall. It is a strong start, a strong end, and a soft middle that gets softer as you fill it.

This is a real architectural distinction, not a preference. Memory in a serious agent is a separate component you query and curate, not a longer prompt you append to. The context window is where memory gets used; it is not where memory lives.

Treating the window as the store also means cost and latency climb with every turn, because you re-send the entire history on each call. You pay more to get a recall curve that is sagging in the middle.

The practical split is the one most production agents land on. Working memory stays in the context window. Durable facts live in an external store. A retrieval step pulls the few records that matter into the window each step, ranked and placed deliberately, with the most important facts at the edges where the model actually reads them.

The failure mode that matters: staleness and confabulation

Here is the part the architecture diagrams skip. The dangerous failure of agent memory is not forgetting. It is remembering wrong with confidence.

Stale facts are the common case, and they need no adversary. A customer's plan changes, a function gets renamed, an address moves. The fact you stored in week two is now false, but it still retrieves on similarity and still reads as authoritative.

In coding agents this is acute: a developer refactors a module and the memory index keeps serving the old signature, so the agent writes code against a snapshot of reality that no longer exists. Without a time-to-live or a contradiction check, stale entries accumulate and pollute every retrieval that touches them.

Confabulation is worse because it compounds. When an agent repeatedly summarizes its own memory, it drifts: facts get smoothed, qualifiers get dropped, and an inferred detail hardens into a stored "fact." A recent survey on memory for autonomous LLM agents describes how errors in evolving memory are cumulative and persistent, unlike static RAG where a bad retrieval is isolated to one step. The agent can internalize its own hallucination as knowledge, then cite that knowledge later as if it were observed. It invents the user, then trusts the invention.

Stale memory retrieves a fact that used to be true. Confabulated memory retrieves a fact that was never true. The agent cannot tell the difference, and neither can your user.

There are quieter failures too, all from ordinary operation. Cross-user contamination, when a shared store leaks one user's facts into another's session; over-application, when a profile fact gets used in a context where it no longer holds; memory-induced sycophancy, when the agent leans on stored preferences to tell the user what it learned they want to hear. None of these need an attacker: they are the default behavior of a memory system nobody governed.

A practical design for honest memory

Honest memory is memory that knows what it knows and can be corrected. Four properties get you most of the way, and none of them require exotic infrastructure.

Provenance on every fact. Store the source and the run that produced it, so any claim can be traced back and checked. A fact with no source is a guess wearing a fact's clothes.
Freshness and TTL. Timestamp every record. Expire or re-verify volatile facts. A plan tier is volatile; a birthday is not. Treat them differently.
Contradiction detection on write. Before storing new input, check it against existing memory and flag conflicts. Using the model itself as a judge to compare a candidate fact against current context catches most stale-versus-new conflicts.
Confidence and an "I don't know" path. Let retrieval return low confidence, and let the agent say it is unsure instead of synthesizing. A memory that can abstain is worth more than one that always answers.

The instrumentation is where this becomes real. You want a memory trace per run that shows what was retrieved, how fresh it was, and whether anything contradicted current input. Here is the shape of a log I would actually watch.

# agent memory trace, one session, instrumented per retrieval

# config: ttl_days=30, contradiction_check=on, min_confidence=0.60

read key="user.plan_tier" value="premium" age=47d conf=0.55 flag=stale

read key="user.timezone" value="PST" age=12d conf=0.91 flag=ok

write key="user.plan_tier" value="standard" src="turn_3" contradiction="premium" action=supersede

# plan_tier was 47d old and below min_confidence, then user corrected it

# without the contradiction check, the agent answers on a stale "premium" fact

# cost of the miss: wrong eligibility quote --> refund --> support ticket

That trace is illustrative, not a client log, but the shape is real. The last line is the point. A stale plan_tier is not an abstract data-quality issue. It is a wrong eligibility quote, a refund, and a support ticket. Memory honesty is a revenue line, not a hygiene preference.

A founder I advised, Priya, shipped a support agent with persistent memory because it demoed beautifully in the pilot: it read each customer's stored profile and answered instantly. Six weeks in, roughly 1 in 12 conversations quoted a plan tier or feature that had changed since the fact was written, because nothing expired the old value. The fix was not more memory; we added a 30-day TTL on volatile fields and a contradiction check on write, which cut the wrong-fact rate to near zero within a week and turned the feature from a liability back into a win.

The contrast that taught me the rule was a second agent that did the opposite: it stored almost nothing and re-queried the system of record on every turn. It was a hair slower, but in three months it never once quoted a fact that had gone stale, because it had no stored facts to go stale. A live lookup against the source of truth beat a remembered copy on every dimension that mattered to the business. I have since killed memory features that had no freshness story, because a system that confidently quotes last quarter's pricing costs more than one that asks the user to confirm.

Memory is RAG with a write path

If long-term memory sounds like retrieval-augmented generation, that is because it is. The retrieve-rank-inject loop is the same one that powers RAG, and the failure modes rhyme. The big addition is the write path, which is also the new way to get hurt. RAG retrieves from a corpus you curated. Memory retrieves from a corpus the agent itself is writing, which means every confabulation can become tomorrow's retrieved "fact."

So the retrieval discipline carries straight over. Most RAG pipelines fail the same way in month three: the demo retrieves perfectly, the corpus grows, the queries drift, and recall collapses quietly while a capable model papers over the gap. Memory inherits all of that and adds drift in the stored facts themselves. When the agent decides what to retrieve and when, you are in agentic RAG territory, and the loop that re-searches until it finds something answerable is just as good at masking rotten memory as rotten retrieval.

The discipline is identical, instrumented for memory: a golden set of facts with known-correct values, freshness measured per fact, and contradiction-detection rates tracked over time. Without that, persistent memory is a more expensive way to be confidently wrong. The store does not save you from evals. It raises the stakes on them. If the knowledge and retrieval layer under your agent is the part that keeps drifting, that is exactly the work Devlyn does on RAG and knowledge integration.

When memory systems for agents are worth it, and when not

Reach for long-term memory when statelessness demonstrably costs you. The signal is concrete: users repeat context the agent should retain, or a task genuinely spans sessions and the cost of re-establishing state is real. That is a measured gap a store can close. Memory belongs to the broader question of which agentic workflows earn their keep, and the answer is the same: only where you can show the gain.

Do not add it when a cheaper fix is on the table. Often the agent does not need to remember across sessions at all. It needs a better prompt, a slightly larger working window, or a single authoritative lookup against your real database instead of a fuzzy memory of it. A live query against the system of record beats a remembered copy that can go stale, every time the source of truth is reachable. Memory is for what you cannot look up, not for caching what you can.

Good fit: durable user preferences, long-running projects, learned procedures the agent should reuse, anything genuinely cross-session.
Bad fit: facts that live in a database you can query directly, anything volatile enough that a cached copy is a liability, demos dressed up as a roadmap.

Frequently asked questions

What is AI agent memory?

AI agent memory is what an agent retains across steps and sessions. It spans short-term memory in the context window, long-term memory in a retrieval-backed external store, and episodic memory of past runs. The point is to give the agent durable state without making it a separate place that can drift out of sync with reality.

What is the difference between short-term and long-term agent memory?

Short-term agent memory is the context window the model reads right now: fast, exact, and gone when the session ends. Long-term agent memory is an external store the agent retrieves from across sessions. Short-term holds the live conversation and working state. Long-term holds durable facts you query back in when they are relevant.

Why do agents with memory hallucinate or get facts wrong?

Two reasons. Stored facts go stale when the world changes and nothing expires or re-verifies them, so old truths keep retrieving. And repeated self-summarization causes drift, where the agent smooths its own memory until an inference hardens into a stored fact. Provenance, freshness checks, and contradiction detection are what keep memory honest.

Is a large context window the same as agent memory?

No. A large context window is short-term working memory, and models attend worst to the middle of long inputs, so recall sags as you fill it. Durable memory is a separate, curated store you retrieve from deliberately. The window is where memory gets used, not where it should live.

If you are designing memory for an agent that has to hold up in front of real users, the design that survives is the one that is honest about what it knows and can be corrected when it is wrong. I go deeper on building agents you can actually trust in the field guide on honest AI agents, and on the retrieval layer memory is built on in my book RAG That Survives Contact With Production; for the broader patterns, see Agents That Actually Work. If you are building one of these for production and want a team that wires in provenance, freshness, and evals from day one, that is what my engineers do at Devlyn.

LLM Evaluation: Measuring What Will Break

Alpesh Nakrani — Thu, 11 Jun 2026 18:30:00 GMT

LLM evaluation is the discipline of measuring whether a language model's output is good enough to ship, using a fixed set of inputs, a defined rubric, and a number you trust before a customer finds the failure for you. It is not unit testing, which checks exact outputs. It is statistical: you sample real cases, score them against a standard, and gate the deploy on the result.

That definition is the easy part. The hard part is that most eval suites measure the wrong thing with great precision, pass every check, and then break in production anyway. I have watched it happen at company after company. The fix is not a better tool. It is better judgment about what to measure, applied by the people who own the model's behavior.

This is the thesis of everything I write: AI does the work, and the human evaluates. When generation gets cheap, evaluation becomes the scarce, defensible skill. So this guide is tool-neutral on purpose. The vendors writing eval content are selling an eval platform, and their advice bends toward their feature list. I am going to give you the harness instead, the one that gates a deploy, with the trade-offs named and the revenue consequence attached.

When generation is cheap, value migrates to whoever can tell good output from bad. Evaluation is that skill.

Key takeaways

LLM evaluation measures whether a probabilistic model's output is good enough to ship, using a frozen set of real inputs, a written rubric, and a scorer you trust.
The common failure is sampling, not metrics: a suite that passes at 95% can sit on a production reality near 70% because the test set never looked like real traffic.
Reference-overlap scores like BLEU and ROUGE lie; task-anchored metrics (correctness, relevance, faithfulness, safety) and model-versus-human disagreement on a frozen set tell the truth.
LLM-as-a-judge is reliable enough for scale only after you calibrate it against human labels; uncalibrated, it measures its own preferences, with documented position, verbosity, and self-preference bias.
Senior engineers who own a model's behavior own its eval suite. Evaluation is the work, not plumbing a platform team does on the side.

What LLM evaluation actually is

LLM evaluation is how you turn a probabilistic system into one you can make a decision about. A model gives a different answer to the same question on different runs. Temperature, context order, and a model version bump all shift the output. You cannot assert a single correct string and call it a test. So you measure distributions: across a representative set of inputs, how often is the output good enough, by a standard you wrote down in advance?

Three pieces make up any honest eval. First, a dataset of inputs that reflects what users actually send, not what you imagined they would. Second, a rubric or metric that defines "good enough" for your task. Third, a scorer, which can be exact-match logic, a human rater, or another model acting as a judge. Change any one of the three and the number changes. That is the whole game, and it is why a single accuracy figure rarely tells you what you think it does.

The reason this matters now is economic. A demo proves a model can do the task once. Evaluation proves it does the task reliably enough that you can put your name on it. I argue in the judgment economy that this gap is where the margin lives. Anyone can generate. Few can reliably tell the good generation from the plausible-but-wrong one at scale, and that capability is what you are actually building when you build an eval suite.

Why eval suites pass and then fail in production

The most common eval failure is not a bad metric. It is a sampling problem dressed up as a measurement problem. Your suite passes at 95% and production runs at 70%, and the gap is not the model getting worse. It is that your test set never looked like real traffic.

Most suites are built bottom-up. A developer writes cases while building the feature. A PM adds a few edge cases during review. The set accumulates into a distribution that reflects the team's imagination, not the user base. Those two distributions diverge, and they diverge most on exactly the inputs that cause incidents: the frustrated user phrasing things oddly, the code-switching query, the malformed paste. The offline-to-production gap is often far larger than teams expect, and it is structural, not bad luck (Label Studio documents the same 95-to-70 pattern across deployments).

The fix is mechanical but demands discipline: sample your eval set from real production traffic, freeze it, version it, and stop letting it grow organically. I cover the full harness in evals that predict production, the cornerstone essay this guide expands on. The short version: a moving eval set is not a ruler. It is a rubber band, and the number it reports is a fact about the test, not the model.

Your suite passed because it was easy. Production failed because it was real. That gap is a sampling decision, not a model defect.

Freezing the set has an uncomfortable implication that teams resist. Your score on a frozen set can only go down, because you cannot sneak in new easy cases to prop the number back up. That is the point. You want a fixed ruler. The reward is something most AI teams cannot honestly claim: a straight historical comparison. Is the model you are about to deploy better or worse than the one you shipped six months ago, on exactly the same questions, scored by exactly the same rubric? Freeze the set and you can answer that. Let it drift and you cannot.

The metrics that matter and the ones that lie

Most published eval metrics are reference-based scores that correlate poorly with whether a real user was helped. BLEU, ROUGE, and exact-match compare output to a gold string. They were built for translation and summarization, and they punish a correct answer phrased differently from the reference. A model can score badly on ROUGE and be right, or score well and be uselessly verbose. These metrics lie by being precise about the wrong thing.

The metrics that matter are task-anchored. For most generation tasks, you want some version of: correctness (is the claim true), relevance (does it answer the actual question), faithfulness (is it grounded in the source you gave it, not invented), and safety (does it refuse what it should). Each is scored against your rubric, not a gold string. None reduces to a single clean number across tasks, and any vendor promising one composite "quality score" is selling you comfort.

Here is the discipline that separates a real metric from a vanity one. A real metric moves when the model changes and holds still when the model holds still. If your number jumps because you reworded the rubric or swapped the rater pool, you measured the test, not the model. I take each metric apart, with the failure mode it hides, in the companion piece on the LLM evaluation metrics that matter, and I dig into which ones survive contact with production in A Field Guide to Evals.

Metric type	What it claims	What it actually measures	Trust it for
Exact-match / BLEU / ROUGE	Output quality	String overlap with one reference	Narrow extraction, classification
Aggregate accuracy	How good the model is	Sampling + rubric + raters + model, mixed	Coarse trend only
Faithfulness / groundedness	No hallucination	Claims supported by provided context	RAG, summarization
Human-vs-model disagreement	Production readiness	Divergence from calibrated humans on a frozen set	Go / no-go gating

The trap is aggregate accuracy. It is affected by your sampling strategy, your rubric, your rater pool, and the model, four variables at once. When the number moves you often cannot say which variable moved it. That makes it a poor basis for a ship decision and a great basis for fooling yourself.

Building an LLM evaluation framework

An evaluation framework is a process, not a product you install. The build-your-own discipline has five parts, and the order matters.

One, define the failure that costs money. Before you write a single test, name what breaking in production actually costs. A wrong SKU on an order. A hallucinated policy in a support reply. A refused query that should have converted. The metric you build should track that failure, not a generic benchmark score. This is the step teams skip, and it is why their dashboards are green while revenue leaks.

Two, sample the dataset from production. Pull real traffic, stratify it by intent, and over-weight the hard slices. Freeze it as a named artifact. The set is the foundation; a perfect rubric on a fake dataset measures nothing.

Three, write the rubric in plain language. A rubric a senior engineer can read and apply in under a minute beats a clever scoring function nobody understands. Ambiguity in the rubric shows up later as rater disagreement, which is a signal, not noise.

Four, choose the scorer per metric. Use exact logic where the answer is deterministic. Use a model judge where it is calibrated and cheap. Use calibrated humans where the margin is hard and the cost of a miss is high. Most teams reach for one scorer for everything; the right answer is a mix.

Five, gate the deploy on a threshold set in advance. Setting the bar after you see the result is not evaluation. It is rationalization. The product and engineering teams agree on the threshold before the run, and the run either clears the gate or it does not. That gate is the offline half of the picture; what you measure after release is the online half, and I draw the line between them in offline vs online evaluation.

The framework that gates a real deploy looks more like a controlled experiment than a CI badge. I walk through the harness end-to-end, including stratified sampling and versioning, in the cornerstone eval essay, and the step-by-step build in how to build an LLM evaluation framework. The point of the framework is not coverage. It is a defensible decision.

If you would rather have this built into a system from day one than retrofit it after the first incident, that is the work a Devlyn AI engineering pod does, with the eval harness gating the deploy. The judgment about what to measure still stays with your team. A pod just builds the rails that let your engineers hold it.

LLM-as-a-judge: when to trust the model grading the model

LLM-as-a-judge means using a strong model to score another model's output against a rubric. Used well, it agrees with human reviewers around 85% of the time on many tasks, which is higher than two humans typically agree with each other (Confident AI reports this range across applications). That makes it the only practical way to score at the volume production demands. It is also where most teams quietly lose the plot.

Trust the judge when three conditions hold. The rubric is concrete and the task has a clear notion of correct. You have calibrated the judge against human labels on a sample and measured agreement, ideally a Krippendorff's alpha near 0.8. And you control for the known biases. Without that calibration step, you are not measuring quality. You are measuring the judge's preferences.

The biases are well documented and they are not subtle. Judges show position bias, favoring the first option in a pairwise comparison. They show verbosity bias, rewarding longer answers regardless of correctness. They show self-preference, scoring outputs from their own model family higher. A 2026 RAND study found no judge is uniformly reliable across benchmarks, with frontier models exceeding 50% error on hard bias tests and consistency breaking on changes as small as reformatting or paraphrasing (Adaline's summary; the underlying scoring-bias work is on arXiv).

An LLM judge does not measure quality until you calibrate it against humans. Before that, it measures its own preferences.

The honest trade-off: a model judge buys you scale and costs you a ceiling. Calibrated humans catch the confident, plausible, wrong answer that a judge waves through, because a domain expert knows the answer is wrong and the judge only knows it sounds right. My rule is to judge with a model where volume demands it, audit a sample with humans on a fixed cadence, and never let the judge grade the hardest 10% of cases unsupervised. That is also a revenue rule: the cases a judge mis-grades are disproportionately the ones that cost you a customer. I go deeper on the calibration protocol and the bias controls in when to trust LLM-as-a-judge, and on how to spend that scarce human attention on the flagged tail in human-in-the-loop evaluation that scales.

Building a golden eval set from production traffic

A golden set is a frozen, versioned collection of real inputs with trusted reference answers, used as the fixed ruler for every model candidate. Build it from production traffic, not from your imagination, and build it to over-represent the cases that hurt.

The sampling that matters is stratified and adversarial. A uniform random sample under-represents every hard case, because hard cases cluster rather than spread evenly. I deliberately over-sample four buckets: cases where the model's confidence was in the bottom quartile, cases where a human reviewer submitted a correction, syntactically adversarial inputs like code-switching or truncated text, and cases that previously caused a production incident even after the root cause was fixed. That last bucket is the one teams skip because the incident feels resolved. It is not resolved until a future model version passes those exact cases on a held-out set.

Reference answers come from calibrated humans under a blinded protocol, so the rater never knows which model version produced an output. Blinding is not academic. I have watched a good engineer give quiet benefit-of-the-doubt to outputs from a model they helped tune. Strip the version, shuffle the order, and plant a known proportion of gold human responses in the batch unlabeled. If raters start scoring the planted human answers below your model's passing bar, the rubric has drifted and you stop scoring until you recalibrate.

Over-sampling the tail is a real trade-off: your aggregate number will look worse than a uniform sample would show. Good. A metric that reflects your hardest real traffic is more honest than one that flatters your average traffic. Ship the model that passes the hard set, not the one with the prettiest headline number. I lay out the full sampling and blinding protocol in how to build a golden eval set from production traffic.

Offline versus online evaluation

Offline evaluation tests a candidate before deploy against a fixed golden set with known good answers. Online evaluation scores live production traffic as it arrives, watching for quality drops, hallucinations, and policy violations without a reference answer. You need both, and they catch different failures.

Offline catches the failure modes you already know about, and it gates the deploy. Online catches the novel ones and the distribution shift your frozen set could not anticipate. The gap between them is the whole reason this is hard: a curated set at 95% can sit on top of a production reality at 70%, because real users do things your set never sampled (LangChain's eval guide frames the same offline-as-regression, online-as-monitoring split).

	Offline eval	Online eval
When	Pre-deploy, in CI	Live, on production traffic
Reference	Known good answers	Usually none
Catches	Known failure modes, regressions	Novel failures, drift
Decision	Ship / do not ship	Alert / roll back / sample for review

The practical discipline: use the same rubric and scorer for both, so an offline pass and an online pass mean the same thing. When they diverge, the divergence itself is your most valuable signal, because it tells you exactly which cases your golden set is missing. Feed those back into the next version of the set. The handoff between the two modes, offline versus online, is where most teams lose the thread. Online evaluation is also where cost and latency live as first-class metrics, since a model that is correct but slow or expensive can still be a product failure. That is the bridge from evals to production observability and monitoring, where the eval rubric becomes a live guardrail.

Evaluating RAG and agents

RAG and agents fail in places a single-turn eval cannot see, so they need system-specific metrics. Evaluating the final answer alone hides where the system actually broke.

For RAG, you evaluate retrieval and generation separately, because a wrong answer can come from either stage. The standard metrics, popularized by frameworks like RAGAS, are context recall (did retrieval surface the documents needed to answer), context precision (how much of what it surfaced was relevant), faithfulness (is the answer grounded in retrieved context rather than invented), and answer relevancy (does it address the actual query). The frameworks now ship a dozen-plus retrieval and generation metrics, separating the two stages so you can fix the right one (RAGAS docs). A faithfulness drop points at generation. A recall drop points at retrieval, chunking, or embeddings.

The failure that the metrics warn you about is the slow one. Retrieval looks perfect in the demo, then the corpus grows, queries drift, and recall quietly collapses over weeks. I tell that whole story, with the decay curve, in why RAG pipelines fail in month three, and I work the retrieval metrics end-to-end in RAG evaluation: measuring retrieval before it collapses. The eval lesson: track context recall on a frozen query set over time, not just at launch, or you will not see the collapse until a customer does.

For agents, the final answer is the least informative thing to score. An agent chains planning, tool calls, retrieval, and sub-agent handoffs across a long trajectory, and it can reach a right answer through a broken path or a wrong answer through a sound one. So you evaluate the trajectory: task completion (did it achieve the user's goal), tool-call accuracy (did it call the right tool with the right arguments), plan quality (did it decompose the task sensibly and know when it had enough to act), and step-level traces for loops and dead ends. The 2026 consensus has moved from single-axis completion scores to multi-dimensional, trajectory-aware evaluation precisely because completion alone hid the failures (Confident AI's agent guide lays out the metric set). I cover the honest limits of agent reliability in an honest look at agents; the eval implication is that an agent good at four tasks and a liability on the fifth needs per-task gates, not one global score. The full trajectory-scoring method has its own walkthrough in how to evaluate an AI agent.

The one metric worth reporting to the business

The single metric I report up is model-versus-trusted-human disagreement on a frozen, production-sampled set, tracked over time. Not aggregate accuracy. One number, anchored to a fixed ruler and to human performance, that tells the truth about whether the model got better.

Here is why it beats accuracy. It is anchored to a fixed distribution, so a change in the number reflects a change in the model, not the test. It is anchored to human performance, so it has a meaningful floor and ceiling. And it is directional: up means worse, down means better, and you can open the exact cases that moved. Leadership wants a number, which is legitimate. This is the number that does not lie to them.

The revenue translation is the part that earns the meeting. Disagreement on the frozen set maps to the failure you priced in step one of the framework: a point of disagreement on the support-resolution cluster is some number of mishandled tickets, some churn, some support cost. When you can say "this candidate cuts disagreement on the revenue-critical cluster from 8% to 6%, which is worth roughly X in retained accounts," you have turned an engineering metric into a business decision. That sentence is why evaluation belongs in the room where budgets get set, and it is the whole argument of the judgment economy made concrete.

Aggregate accuracy is for dashboards. Model-versus-human disagreement on a frozen set, priced in dollars, is for decisions.

Running the harness: what an eval run looks like

Here is an abbreviated run against a frozen set. The numbers are realistic but illustrative, not from a specific live system.

# offline eval against frozen golden set

python -m eval.runner \

--suite golden-2026-w24-v2.jsonl \

--model prod-candidate-2026-06-15 \

--judge calibrated-judge-v3 --human-audit 0.10

# results summary

cases evaluated 912

faithfulness 0.948 # RAG cases only, n=410

context recall 0.882 # down 0.021 vs prior, FLAG

tool-call accuracy 0.913 # agent cases, n=190

human disagree 5.8% # threshold 8.0%, PASS

adversarial tail 13.4% # threshold 18.0%, PASS

judge-vs-human agree 0.79 alpha # floor 0.75, PASS

p95 latency 2,110 ms # +180 ms, review

verdict GATE HOLD # recall regression blocks deploy

Read what that run is telling you. Aggregate quality looks fine, but context recall dropped 0.021 and the gate holds on that alone, because a recall regression in RAG is the slow-collapse failure starting early. The judge-vs-human alpha at 0.79 confirms the judge is still calibrated enough to trust this run; if it had fallen below the floor, every other number would be suspect and the run would void. The latency flag does not block by itself, but it forces a human review before any override. This is the difference between a harness and a dashboard: the harness makes a decision and shows its work.

Who owns evals: senior engineers, not a tooling team

The engineers who own a model's behavior in production own the eval suite for that behavior. They write the rubric, sit in on rater calibration, read the disagreement reports, and decide when the rubric needs revision. This is not infrastructure work that a platform team does on the side. It is the work.

The failure mode I see most is treating evals as plumbing: a platform team owns the runner, it produces a number in CI, and product engineers passively consume it. That arrangement reliably measures the wrong things with great precision, because the people who understand what "good" means for the task are not the people defining how it is scored. Shipping a model without understanding the eval suite that gated it is the same as shipping code without understanding the tests.

This stance only scales when the eval infrastructure is legible enough that a senior engineer can trace any metric back to the sampling and rubric choices that produced it. Legibility is an engineering requirement, not a nice-to-have. The deeper argument is that Human in the Loop Is Not a Plan: you cannot outsource production judgment to a review queue and call it a quality system, and you cannot outsource it to a tooling team either. I make the operational case for keeping humans on the judgment, not the volume, in the human-loop essay. The model meets the bar on your frozen, adversarially-sampled, human-calibrated set, or it does not ship.

The eval-driven posture takes this further: let the test suite lead the model, write the eval before the feature, and treat a failing eval as a spec for what to build next. That is the subject of Eval-Driven Development, and it is the cleanest way I know to keep judgment at the center of the loop as autonomy grows.

Tools: a neutral read

The tooling landscape splits into a few honest categories, and the right choice depends on what you are gating, not on which vendor has the best demo. I am naming categories, not endorsements.

Open-source metric libraries give you reference metrics and RAG scorers you run yourself. Best when you want full control of the rubric and no vendor in your data path. RAGAS and DeepEval sit here.
Tracing and observability platforms capture production traces and run online evals on live traffic. Best when your hardest problem is seeing what the system actually did. Arize, LangSmith, and Braintrust sit here.
Managed eval platforms bundle dataset management, judges, and human-annotation workflows. Best when you want one workflow and will accept the platform's opinions. Most vendor blogs are written from here.

The trade-off no tool removes is ownership. A platform makes running evals easier; it does not make your rubric correct or your sampling honest. The judgment about what to measure stays with your engineers no matter what you buy. Pick the lightest tool that lets a senior engineer trace a number to its source, and spend the saved effort on the dataset and the rubric, which are the parts that actually determine whether your eval predicts production. I compare the categories in more depth, still vendor-neutral, in LLM evaluation tools compared.

Frequently asked questions

What is LLM evaluation? LLM evaluation is the practice of measuring whether a language model's output is good enough to ship, using a fixed set of representative inputs, a written rubric, and a scorer. Because models are probabilistic, you measure distributions of quality rather than asserting one correct output, and you gate the deploy on the result.

How do you evaluate an LLM? Sample real inputs from production, freeze them into a versioned golden set, define a rubric for what good means on your task, score outputs with exact logic, a calibrated model judge, or human raters, and gate the deploy on a threshold agreed in advance. Evaluate retrieval and reasoning trajectories separately for RAG and agents.

What metrics should I use for LLM evaluation? Use task-anchored metrics like correctness, relevance, faithfulness, and safety rather than reference-overlap scores like BLEU or ROUGE, which punish correct answers phrased differently. The one metric worth reporting up is model-versus-trusted-human disagreement on a frozen production set, tracked over time.

Is LLM-as-a-judge reliable? It is reliable enough for scale once calibrated against human labels, agreeing with humans around 85% of the time on many tasks. Without calibration it is unreliable, because judges show position bias, verbosity bias, and self-preference, and 2026 research found frontier judges exceeding 50% error on hard bias tests.

How do I build a golden eval set? Sample real production traffic, stratify by intent, and over-sample hard cases: low-confidence outputs, human corrections, adversarial inputs, and past incidents. Have calibrated humans write reference answers under a blinded protocol, then freeze and version the set so your ruler never moves.

What is the difference between offline and online evaluation? Offline evaluation tests a candidate before deploy against a fixed set with known answers and gates the ship decision. Online evaluation scores live traffic without references to catch novel failures and drift. You need both, and the gap between them tells you which cases your golden set is missing.

Where to take this next

If you are building the harness yourself, start with the cornerstone on evals that predict production and the deeper reference in A Field Guide to Evals. Both go past the overview here into the sampling, blinding, and rubric protocols that decide whether your evals are worth trusting.

If you are shipping a real system and want evaluation and observability built in from day one rather than bolted on after the first incident, that is exactly the work a Devlyn AI engineering pod does, with the eval harness gating the deploy and production monitoring carrying the same rubric into live traffic. The point of all of this is one thing: the machine does the work, and you keep the judgment. Build the suite that predicts production, freeze it, and trust the number it gives you over the number you wished it gave.

Human-in-the-Loop Evaluation That Scales

Alpesh Nakrani — Wed, 10 Jun 2026 18:30:00 GMT

Human-in-the-loop evaluation scales only when humans review the right slice: the low-confidence, high-stakes, adversarial tail that an automated layer flags first. Reviewing every output is not the safe choice it looks like. It is a bottleneck that becomes a rubber stamp, then a liability. The scarce resource is human judgment, and you spend it where the machine is least sure and the cost of being wrong is highest.

I argue the negative case at length in Human in the Loop Is Not a Plan: an unspecified reviewer collapses under load. This piece is the positive case. If review-everything is the failure, what does review-the-tail actually look like as a system you can staff, measure, and defend? The answer turns on three things most teams skip: how you blind the rating, how you measure agreement, and how you keep the reviewers themselves calibrated. It is one branch of my complete guide to LLM evaluation; this is the human branch.

Key takeaways

If you read nothing else, these are the load-bearing claims:

Review the flagged tail, not every output. An automated layer ships the high-confidence, low-stakes majority and routes only the low-confidence, high-stakes, and adversarial slice to a person.
Blind the rating or you measure the wrong thing. Strip model names, randomize order, and withhold the running pass rate, or reviewers grade the story they already believe.
Measure your humans before you trust them. Compute a chance-corrected agreement statistic (Cohen's kappa or Krippendorff's alpha); below the bar, the rubric is the problem, not the people.
Calibrate the reviewers, not only the model. Seed gold cases, re-blind monthly, and cap load, because automation bias turns a tired reviewer into a rubber stamp.
The review tier is a cost line. Routing 5% of volume to humans instead of 100% is the difference between four reviewers and eighty.

Spend human judgment where the machine is least sure and the cost of being wrong is highest. Everywhere else, let the automated layer carry the load.

The triage architecture: route by confidence and stakes

Human-in-the-loop AI works when the loop is a router, not a wall. Every output passes through an automated layer first. That layer assigns two things to each item: a confidence signal and a stakes label. The pair decides where the item goes.

High confidence, low stakes: ship automatically. This is most of your traffic, and a human adds cost without changing the outcome.
Low confidence, any stakes: route to a human. The model told you it was unsure; that is exactly the signal worth a person's time.
High stakes, any confidence: route to a human regardless of score. A confident wrong answer on a prescription or a contract is the most expensive failure there is.
Adversarial or out-of-distribution: route to a human, and add the case to the eval set. Novelty is where confidence scores lie, and where hallucinations hide.

Confidence here is not the model's raw token probability, which is poorly calibrated. It is a derived signal: agreement between an LLM judge run twice, the margin in a pairwise comparison, a retrieval-grounding check, or a small ensemble that disagrees. The point is that the automated layer pre-sorts the pile so the human review of LLM output lands only on cases that move the needle.

Priya, a staff engineer on a clinical-summary tool, ran the arithmetic before she staffed anything. Her judge flagged about 6% of outputs as low-confidence or high-stakes. Routing that 6% to two clinicians caught 91% of the errors her old random-sample QA had been missing, at a fraction of the reviewer hours. The other 94% shipped under the automated gate, and the error rate on that slice never moved.

This is the difference between human-in-the-loop and human-on-the-loop. In-the-loop means a person gates specific items before they ship. On-the-loop means a person monitors the aggregate and intervenes when a metric drifts, which is the same split I draw in offline versus online evaluation. A mature system uses both: in-the-loop for the flagged tail, on-the-loop for everything that cleared automatically. You watch the river and you inspect the rocks.

Blinded rating, or you are measuring the wrong thing

When a human does review, how you present the work decides what you learn. Unblinded rating contaminates the result. If a reviewer can see that output A came from the new model and output B from the old one, they rate the story they already believe, not the text in front of them.

Blinding for human feedback evaluation means three concrete moves. Strip the source: no model names, no version tags, no "this is the candidate we hope wins." Randomize order on every pairwise comparison, because position bias is real for people too, not only for an LLM judge. And withhold the aggregate: a reviewer who knows the model is "passing at 94 percent" will unconsciously round up the marginal case to keep the streak alive.

# What the reviewer sees - source stripped, order randomized

item = {"prompt": ..., "a": resp_X, "b": resp_Y} # X/Y hidden

show_order = shuffle(["a", "b"]) # kill position bias

# Reviewer never sees: model name, version, running pass rate

Blinding costs almost nothing to build and it changes the numbers. On one launch, Marcus, the eng lead, was sure the new model was "clearly better" and had the unblinded thumbs-up to prove it: 78% of reviewers preferred it. We re-ran the same comparison blinded, and the preference fell to 51%, a coin flip. The launch slipped two weeks, and that two weeks was worth more than the launch. An unblinded thumbs-up is a vibe. A blinded preference, collected under a rubric, is evidence.

Inter-rater agreement: measure your humans before you trust them

A single reviewer's verdict feels authoritative and tells you nothing about whether it is repeatable. Before human ratings can gate anything, you measure how much your reviewers agree with each other. Hand the same sample to two or three of them, blinded, and compute inter-rater agreement.

Use a chance-corrected statistic, not raw percent agreement. Raw agreement looks high whenever one label dominates, which it usually does. Cohen's kappa corrects for chance on two raters; Krippendorff's alpha generalizes to any number of raters, handles missing labels, and works across nominal and ordinal scales, which is why it is the safer default for a real review panel. The common reading of kappa, from the Landis and Koch convention, treats 0.61 to 0.80 as substantial and above 0.81 as near-perfect. For alpha, 0.80 is the usual bar for trusting a label, with 0.667 a floor for tentative conclusions.

Here is the part that stings. When experienced reviewers disagree on a third of the cases, the problem is almost never the people. It is the rubric. Two experts grading "is this answer good" against private intuition will produce two different measurements of two different things. Disagreement is not noise to average away; it is a map of exactly where your rubric is ambiguous. Read the disputed cases, sharpen the criteria, and re-measure. This is the same loop that makes a golden eval set trustworthy: the test is only as honest as the agreement behind it.

When two experts disagree on a third of the cases, fix the rubric, not the people. Disagreement is a map of where your criteria are ambiguous.

Calibrate the reviewers, not only the model

Everyone talks about calibrating the model. Almost nobody calibrates the humans. Reviewers drift. They get tired, they get fast, and they slide toward the model's answer because it has been right for a long time. That last drift, automation bias, is the quiet killer: the reviewer becomes a confirmation step, not an evaluation.

The fix is to treat reviewers as instruments that need recalibration on a schedule. Three practices hold the line:

Seeded gold cases. Salt the queue with items that have a known correct verdict. If a reviewer misses the seeded failures, their recent ratings are suspect and the rubric or the training needs another pass.
Periodic re-blinding against each other. Re-run the inter-rater check monthly, not once at onboarding. Agreement decays as the easy cases get automated away and only the hard tail reaches the human.
Rotation and load caps. A reviewer with 300 items in the queue is not evaluating. Cap the daily load and rotate the panel so fatigue does not masquerade as consensus.

The same instability shows up in the automated layer, which is why you cannot lean on it blindly either. A 2025 study on LLM judges, Rating Roulette, documents that a model grading the same item twice will often return different verdicts. Self-inconsistency in the judge is one more reason the tail needs a calibrated human, and one more reason that human needs checking too. Top judges land near 80 percent agreement with people on well-scoped tasks, roughly where two trained humans land with each other, a result first measured on MT-Bench. That is good enough to triage and not good enough to be the last signature.

The org and cost angle: who owns the loop

Human-in-the-loop evaluation is an org design problem wearing an engineering costume. The review tier is a real line item, and most teams discover it only after the queue collapses. Run the arithmetic before you scale, not after.

Say a senior reviewer's loaded cost is roughly $90 an hour and a careful review takes 4 minutes. Reviewing 10,000 outputs a day at 100 percent means about 667 reviewer-hours a day, which is north of 80 full-time people. Route only the flagged 5 percent to humans and the same volume costs about 33 reviewer-hours, four or five people, with the automated layer carrying the rest. The triage architecture is not a quality nicety. It is the difference between a unit economic that works and one that does not.

That is also where the two seats see different things, and why the decision needs both. From the engineering seat, the routing threshold is a tunable parameter. From the revenue seat, that same threshold is a bet on margin and liability at once: loosen it and you ship faster but more bad outputs reach a customer; tighten it and quality holds but the review bill climbs and latency grows. Whoever owns the loop has to read the trace and the P&L in the same glance. Evaluation is the scarce, defensible skill here, and the org that prices it correctly compounds an advantage the one that hand-waves "a human checks it" never will.

My rule for ownership: the senior engineer who ships an AI feature owns its review design as a first-class artifact, the same way they own its tests. They define the confidence signal, set the routing thresholds, write the rubric, and run the inter-rater check. The reviewers are part of the system they built, not a team they handed the problem to. Autonomy expands as the eval coverage and the agreement numbers earn it, and contracts the moment the metrics say so.

The honest trade-off

Routing by confidence and stakes only works if the routing is right, and the routing is never permanently right. Set the threshold too loose and confident-wrong outputs ship under a green light. Set it too tight and everything routes to a human, which rebuilds the bottleneck you spent all this effort to dismantle. The threshold drifts as traffic shifts, as the model updates, as adversaries learn your gaps. There is no set-and-forget version. You are signing up to monitor and re-tune the loop forever, which is the human-on-the-loop half of the job and squarely an AI observability and monitoring problem. The alternative, a static rule and a tired reviewer, costs more. It just hides the bill until a customer finds it for you.

Frequently asked questions

What is human-in-the-loop evaluation?

Human-in-the-loop evaluation is a design where people review a routed subset of AI outputs rather than all of them. An automated layer scores every output for confidence and stakes, ships the safe majority, and sends the low-confidence, high-stakes, and adversarial tail to a human whose verdict is authoritative. It scales because human judgment is spent only where it changes the outcome.

How do you measure agreement between human reviewers?

Give the same blinded sample to two or more reviewers and compute a chance-corrected statistic. Cohen's kappa works for two raters; Krippendorff's alpha generalizes to many raters and mixed scales. Treat alpha at or above 0.80 as trustworthy and read every disputed case, because disagreement usually means the rubric is ambiguous, not that a reviewer is wrong.

Is human review better than an LLM judge?

Neither alone is enough. An LLM judge is cheap and fast but self-inconsistent and biased, landing near 80 percent agreement with humans on well-scoped tasks. Use the judge to triage the volume and humans to gate the flagged tail. The judge sorts the pile; a calibrated, blinded human decides the cases that change a release, a contract, or a customer.

How many outputs should humans review?

As few as your error tolerance allows, chosen by stakes and confidence rather than a fixed percentage. Review 100 percent of high-stakes outputs, a calibrated sample of mid-stakes ones, and only the exceptions for low-stakes traffic. The right number is whatever keeps undetected error below your threshold without turning reviewers into rubber stamps.

If you are designing the review tier rather than just naming a reviewer, that is the work I keep returning to in Human in the Loop Is Not a Plan and the broader harness in A Field Guide to Evals. And if you want a team to instrument the routing thresholds and drift monitoring in your stack, that is what Devlyn's AI observability and monitoring work is for. Design the loop, measure the humans, and price the tier before a customer prices it for you.

The Best AI Agents in 2026 (An Honest Roundup)

Alpesh Nakrani — Tue, 09 Jun 2026 18:30:00 GMT

The best AI agents in 2026 fall into four categories, and each is genuinely good at a different job. Coding agents (Claude Code, Codex CLI, Cursor) are the most mature, handling multi-file changes inside real repositories. Deep-research and browser agents (Perplexity, OpenAI Deep Research, Operator-style computer use) gather and synthesize across the web. Customer-ops agents (Intercom Fin, Ada and similar) resolve a majority of support tickets before a human sees them. Orchestration frameworks (LangGraph, OpenAI Agents SDK, CrewAI) are how you wire the rest together. None of them removes the human. They move the human to the end of the loop.

I have spent two years putting agents into production from the seat where engineering meets revenue. So this is not a hype list. It is an honest accounting of what each category is good at, where it quietly fails, and where you still need a person to evaluate the output. It sits under my broader field guide to where agentic workflows actually earn their keep, and the thesis I keep coming back to holds here too: the machine does the work, and the human evaluates.

Below I name the strongest agents per category, the honest limit on each, and a comparison table you can act on. I do not rank a single winner, because the best AI agent is the one that fits the task you actually have.

The best AI agent is not the smartest one. It is the one whose failure mode you can catch before a customer does.

Key takeaways

If you read nothing else, read these.

Coding agents are the most mature category. Claude Opus 4.8 reports 88.6% on SWE-bench Verified; the agent harness around the model matters as much as the model itself.
Research agents are fast but not yet trustworthy. They draft a literature review in minutes and still cite sources that do not say what the summary claims. You verify before you ship.
Customer-ops agents resolve roughly 55-70% of tickets autonomously. The remaining 22% or so escalate to a human, and the warm handoff is the part that earns the ROI.
Browser and computer-use agents still trail humans badly. Top scores sit near 72% on OSWorld against a human baseline; treat them as assistive, not autonomous.
Frameworks are not the product. LangGraph, the OpenAI Agents SDK, and CrewAI trade control for setup speed. Pick by what they cost you in observability and lock-in.

The best AI agents in 2026, by category

Here is the honest comparison. Each row names a category, the strongest current examples, what it is genuinely good at, and the limit I would not pretend away.

Agent / category	Best for	Honest limit
Coding agents Claude Code, Codex CLI, Cursor	Multi-file refactors, bug fixes, and scoped features inside a real repository	Loses the plot on large, ambiguous changes; the same model scores differently per harness, so results are not portable
Research / deep-research agents Perplexity, OpenAI Deep Research	Multi-step web research, source gathering, and first-draft synthesis in minutes	Hallucinates citations and overstates what sources say; needs a human to verify every load-bearing claim
Browser / computer-use agents Operator-style, Claude computer use, Comet	Clicking through known web flows: forms, lookups, repetitive UI tasks	Real-world accuracy near 72% on OSWorld vs roughly 72% human baseline still leaves it fragile on novel screens
Customer-ops agents Intercom Fin, Ada and peers	Tier-1 support: resolving common tickets and routing the rest with context	Resolves 55-70% autonomously; the other third escalates, and a bad handoff erases the savings
Orchestration frameworks LangGraph, OpenAI Agents SDK, CrewAI	Wiring agents, tools, and handoffs into an auditable production system	Not an agent themselves; more autonomy means more failure surface and more to observe

Coding agents are the most mature category

If you want one category that genuinely earns its keep today, it is coding agents. They operate in a near-ideal environment for autonomy: the task is bounded, the output is testable, and a failed change is reversible with a git revert. That combination is exactly the narrow band where agents work.

The benchmarks back this. Claude Opus 4.8 reports 88.6% on SWE-bench Verified, among the highest published figures, and Claude Code is strong on multi-file refactors in large codebases. On Terminal-Bench, Codex CLI on GPT-5.5 tops the generally available models at 83.4%, with Claude Code close behind. Cursor wins on flow: fast autocomplete and in-editor chat for small, scoped tasks.

Here is the honest limit, and it is load-bearing. The same model scores differently in different agent harnesses. In one February 2026 test, three frameworks running the same underlying model finished 17 issues apart on 731 problems. The wrapper matters as much as the model, which means a benchmark number does not transfer to your repo. You still read the diff. I wrote more on this in my guide to where agentic workflows actually earn their keep, and went line by line in the walkthrough on agentic coding.

A small team I advised swapped manual bug-fixing for a coding agent on a 400,000-line Rails codebase. On tickets with a failing test attached, the agent cleared 31 of 40 on the first pass over two weeks. On vague tickets with no test, it cleared 4 of 17 and quietly introduced two regressions. Same model, same harness. The only variable was whether the work came with an oracle. That is the whole lesson of coding agents in one experiment.

The skill that pays here is writing the contract before the agent runs, and standing up the evals that grade it. That is the work my team does daily, and if you want engineers who have shipped coding agents into real repositories rather than demos, you can hire AI engineers who have done it before. The deeper build path is in my walkthrough on how to build AI agents.

There is a second limit that the leaderboards hide. Coding agents are good at the change you can describe and verify. They are weak at the change you cannot. A refactor with a clear test is ideal. A vague ticket like "make checkout faster" is not, because the agent has no oracle to grade itself against. The skill that still pays is writing the spec tightly enough that the agent has something to optimize toward. The model writes the code. You write the contract it has to satisfy.

A coding agent works because the test suite is the evaluator. Strip away the tests and you are back to trusting a confident guess.

Research and browser agents: fast, useful, not yet trustworthy

Deep-research agents are the most seductive category and the one I trust least without review. Perplexity wraps up a research run in about three minutes; OpenAI Deep Research takes 7 to 20 minutes and is more reliable. Both produce a structured draft with citations, which is genuinely useful as a starting point.

The limit is accuracy you cannot see. These agents hallucinate sources and summarize claims the underlying page does not make. The error is invisible because the output looks authoritative and the citations look real. So the rule is simple: an AI research agent drafts; a human verifies every claim you will stand behind. Treat it as a fast intern, not a fact-checker.

The 2026 versions are more capable, which makes this harder, not easier. Perplexity now routes a research run across more than 20 models and can produce spreadsheets, dashboards, and slide decks directly from it. A polished deliverable raises your trust faster than the underlying accuracy justifies. That gap between presentation and reliability is exactly where a confident wrong answer slips through. The better the agent looks, the more disciplined your review has to be.

Browser and computer-use agents are even further from autonomy. On OSWorld, the standard real-computer benchmark, the human baseline sits at 72.36%, and the strongest agents only crossed it in late 2025 after starting near 7% a year earlier. That sounds like a finish line until you remember a roughly 28% failure rate on a multi-step UI task compounds fast across a flow. They earn their keep on known, repetitive screens. They break on the one they have never seen.

Customer-ops agents: a majority resolved, the rest escalated well

Customer-ops agents are where the revenue case is clearest. Intercom's Fin AI Agent reports an average 67% resolution rate across 7,000-plus customers and tens of millions of conversations, a vendor-published figure worth treating as a marketing claim until you pilot it yourself. Across thousands of production deployments, autonomous resolution consistently lands in the 55-70% band. Well-configured systems claim higher.

The number that matters for the P&L is not the resolution rate. It is the quality of the escalation. Roughly a third of conversations hand off to a human, and a human who receives full context (issue category, account state, steps already taken) resolves the ticket far faster than one starting cold. That is where the ROI lives: not in the tickets the agent closes, but in how cleanly it passes the ones it cannot.

One SaaS support lead I worked with shipped a Fin-style agent that resolved 61% of tickets in month one, and her team nearly killed it in week three. The agent was closing the easy half cleanly but dumping the hard half on agents with no summary, so handoffs took longer than before the agent existed. We rebuilt the escalation payload to carry the conversation summary and the last action attempted. Resolution barely moved, but median handle time on escalated tickets dropped by about a third. The agent did not get smarter. The handoff did.

The honest limit is that the failing third is where your brand risk concentrates. An agent that closes 67% of tickets and botches the handoff on the rest can cost more than it saves. So you instrument the escalation path first, then expand what the agent handles. This is the same evaluation discipline I argue for in the honest accounting of what agents can do today, and it is why teams that bolt on AI observability and monitoring from day one catch the botched handoff before a customer does.

The published ROI figures make the case worth taking seriously. Vendor benchmark data puts the average return near $3.50 per $1 invested, with a typical 3-to-6-month payback. Those numbers are real and also conditional. They assume a clean handoff and a safe set of starting workflows. Start the agent where mistakes are easy to detect and cheap to reverse, then widen the band as the evaluation data tells you it is safe. Lead with the hard cases and the same numbers turn negative.

Orchestration frameworks: how you wire the rest together

Frameworks are not agents. They are how you assemble agents, tools, and handoffs into something you can observe and roll back. In 2026 the field is crowded: OpenAI shipped an Agents SDK, Google launched ADK, Hugging Face released Smolagents, and LangGraph surpassed CrewAI in GitHub stars on the back of enterprise adoption.

The honest framing is a trade, not a ranking. LangGraph models your system as a directed graph with explicit checkpoints, which maps cleanly to audit trails and rollback points but costs you boilerplate. The OpenAI Agents SDK centers on the handoff: agents transfer control and carry context. CrewAI uses a role-based model and needs the least setup. More autonomy in any of them means more failure surface to instrument.

Pick the framework by what it costs you in control, observability, and lock-in, not by the feature list. I go deeper on that trade in my neutral comparison of agentic AI frameworks from production. The full design discipline behind reliable agents is the subject of my book, Agents That Actually Work.

How to choose the best AI agent for your task

The best AI agent is the one that fits the task you actually have, so I screen candidates with four questions before I trust any leaderboard. They cut across every category above.

Is the task bounded? Can you write the start and stop condition in one sentence? "Triage these tickets" is bounded. "Run support" is not.
Can you verify the output mechanically? A passing test, a validated schema, a resolved ticket. If the only check is a careful human read, autonomy will not scale.
Is a mistake cheap to reverse? A code change reverts. A sent refund does not. Reversibility decides how much rope the agent gets.
What does the escalation path look like? When the agent hits its limit, does the human inherit full context or a cold start? This is where the ROI quietly lives or dies.

A task that clears all four is in the band where an agent earns its keep. A task that fails two or more is a workflow, a single model call, or a job for later. The discipline that turns those questions into a repeatable check is an eval harness, which I cover in the guide to how to evaluate an AI agent on its trajectory. The full version of the framework, with production examples, is in my book Human in the Loop Is Not a Plan.

Where you still need a human

Across every category, the human moves to the same place: the end of the loop, evaluating output the agent cannot verify about itself. A coding agent cannot know its change broke a downstream contract. A research agent cannot know its citation is wrong. A support agent cannot know the angry customer needed empathy, not a refund policy.

That is the judgment economy in one paragraph. When agents make generation and action cheap, value migrates to whoever can tell good output from bad. The best AI agent platforms in 2026 are the ones that make that human evaluation fast, contextual, and cheap. The worst ones hide the failure until a customer finds it.

Frequently asked questions

What are the best AI agents in 2026?

The best AI agents in 2026 are coding agents (Claude Code, Codex CLI, Cursor), deep-research agents (Perplexity, OpenAI Deep Research), customer-ops agents (Intercom Fin, Ada), and orchestration frameworks (LangGraph, OpenAI Agents SDK, CrewAI). Each is strong in a narrow band, and none removes the need for human review.

What are the best AI agent tools for developers?

Coding agents are the strongest AI agent tools today because the work is bounded and testable. Claude Code leads on multi-file refactors in large repositories, Codex CLI tops the available models on Terminal-Bench, and Cursor is best for fast, in-editor edits. Benchmark scores do not transfer cleanly between harnesses, so validate on your own codebase.

Are AI agents reliable enough to run without humans?

No, not in 2026. Top customer-ops agents resolve 55-70% of tickets autonomously and escalate the rest. Browser agents still trail the human baseline on real tasks. The reliable pattern is the agent doing the work and a human evaluating the output, with the escalation path instrumented before you scale autonomy.

How do I choose the best AI agent platform?

Choose by the task you have, not by a leaderboard. For code, pick a coding agent with a strong harness. For support, pick a platform whose escalation handoff carries full context. For custom systems, pick an orchestration framework by what it costs you in observability and lock-in. The best AI agent platforms make human evaluation fast.

If you are deciding which agent to build, buy, or kill, the answer is rarely the model and almost always the loop around it. The full design discipline behind that loop is in my book Agents That Actually Work. And if you want a team that ships agents with evaluation and observability built in from day one, you can hire AI engineers who have done it in production. Bring the failing third with you. That is the part worth getting right.

Agentic RAG: When Your Agent Needs to Retrieve

Alpesh Nakrani — Mon, 08 Jun 2026 18:30:00 GMT

Agentic RAG is retrieval where the model decides when to search, what to search for, whether the result is good enough, and whether to search again. Static RAG retrieves once and hopes. Agentic retrieval turns the lookup into a loop: plan a query, fetch, judge the evidence, refine, repeat, then answer. It beats static RAG on multi-hop and ambiguous questions, and it buys that lift with real cost, latency, and a new failure surface.

So the rule is narrow. Use agentic RAG only where one-shot retrieval demonstrably fails. Not because the architecture is fashionable. Because you have a measured gap that an extra retrieval hop closes and a one-shot pipeline cannot.

I write this from two years of putting retrieval systems into production, and from the seat where the inference bill is a line item I have to defend. Most teams reach for agentic RAG too early, pay 5 to 10x the cost per query, and never measure whether the static pipeline was actually the thing that broke. If you want that measured-first discipline on a real workload, it is the core of how Devlyn builds RAG and knowledge integration: prove where one-shot retrieval fails before adding a single loop.

Static RAG retrieves once and hopes. Agentic RAG decides when to retrieve, judges what came back, and searches again. That loop is the whole product, and the whole bill.

Key takeaways

If you read nothing else, read these.

Agentic RAG makes retrieval a decision, not a step. The model controls when and what to retrieve, evaluates the evidence, and re-retrieves until it has enough.
The win is concentrated. Multi-hop and ambiguous queries see the largest gains in published benchmarks; single-fact lookups see little or none.
The cost is real. Expect several extra model calls, materially higher token spend, and seconds of added latency per answer.
It adds failure modes static RAG never had. Retrieval loops, query drift mid-run, and confident answers built on a bad self-judgment of "enough."
Decide with evals, not vibes. If your one-shot recall is fine on the queries that matter, agentic retrieval is a tax with no return.

What is agentic RAG?

Agentic RAG is retrieval-augmented generation where an agent governs the retrieval process instead of a fixed pipeline. In classic RAG, code embeds the query, pulls the top k chunks, and stuffs them into the prompt. The model never decides anything about retrieval. In agentic RAG, the model decides: it can rewrite the query, choose a source, call a tool, read what came back, and rule it insufficient. Then it goes again.

The line is the same one I draw for agents generally. If a developer wrote the retrieval path and it never changes, that is a workflow. If the model chooses the next retrieval action at runtime, that is agentic. I made the broader version of this argument in the pillar field guide on AI agents and agentic workflows. Retrieval is one of the cleaner places to add a thin agentic layer, because the action is bounded and mostly reversible: a search that returns nothing useful costs tokens, not a sent email.

Three behaviors define the loop. Query planning: the model decomposes a question into sub-queries it can actually answer. Iterative retrieval: it fetches, reasons over the result, and forms a sharper follow-up query. Self-evaluation: it judges whether the retrieved evidence supports an answer, and rejects its own draft if it does not. A recent survey of agentic RAG frames these as reflection, planning, and tool use layered onto the retrieval path.

Where agentic retrieval beats static RAG

The gain shows up on a specific shape of query: questions that need facts from more than one place, where you do not know the second place until you have seen the first. These are multi-hop queries, and they are exactly where a single retrieval pass fails.

Take a real-sounding support question: "Does the lens coating my customer ordered last month qualify for the new vision plan, given their carrier?" A one-shot retriever embeds that whole sentence and pulls chunks that are kind of about all of it and precisely about none of it. An agentic retriever splits the work: find the order, find the coating, find the carrier's current plan rules, then reason across the three. Each hop's answer shapes the next query.

The published numbers point the same direction, and I treat them as directional rather than promises. Across 2026 surveys and benchmark write-ups, agentic RAG reports roughly a third higher average accuracy than static RAG, with the largest lift on multi-hop questions and the smallest on single-fact lookups. On standard multi-hop sets like HotpotQA and 2WikiMultiHop, iterative agentic approaches post the top scores; a recent structured-reasoning study of multi-hop RAG reports its iterative method reaching state-of-the-art accuracy across these benchmarks. The pattern is consistent: the harder the cross-document reasoning, the more the loop earns.

The win is not "agentic RAG is better." It is "agentic RAG is better on multi-hop and ambiguous queries, and roughly even everywhere else." Know which kind of query you are buying for.

Ambiguous queries get a quieter benefit. When a question is vague, a self-evaluating retriever can notice the first results are scattered and reformulate before answering, instead of confidently summarizing the wrong chunk. Static RAG has no mechanism to notice it retrieved poorly. It just answers.

The cost, latency, and failure surface

Here is the part the architecture diagrams skip. Every loop is more model calls. A static query is one retrieval and one generation. An agentic query can be a planning call, two or three retrieval-and-reason cycles, a self-evaluation call, and a final answer. That is the difference between one inference and six, and the bill scales with it. Expect a 5 to 10x cost multiple and several seconds of added latency per answer for the heavier patterns. Re-retrieval strategies in the literature push tail latency toward 20 to 30 seconds, which is disqualifying for anything interactive.

Latency is a revenue decision, not a technical footnote. A 6-second answer in a live chat path changes abandonment, and abandonment changes conversion. The CRO in me has killed technically superior retrieval designs because the latency budget did not survive contact with a real funnel. The cheapest correct answer usually beats the most accurate slow one.

The loop also adds failure modes static RAG never had. The agent can spin, re-retrieving on a query it keeps failing to satisfy, burning budget with no exit. It can drift, letting an early bad chunk steer every later query off the real question. Worst, it can misjudge sufficiency, deciding the evidence is "enough" when it is thin, then answering with the full confidence of a system that believes it checked its work. A self-grader that grades itself generously is more dangerous than no grader, because it manufactures false assurance.

# agentic RAG trace, one user query, instrumented per hop

# config: max_hops=4, sufficiency_threshold=0.70

hop 1 query="lens coating order last month" recall@5=0.81 sufficiency=0.42 decision=retry

hop 2 query="acme premium coating sku eligibility" recall@5=0.74 sufficiency=0.61 decision=retry

hop 3 query="carrier vision plan 2026 coating rule" recall@5=0.88 sufficiency=0.79 decision=answer

# 3 hops, 6 model calls, 5.9s total, $0.041/query vs $0.004 static

# watch: if sufficiency plateaus < 0.70 across hops, agent spins to max_hops and answers anyway

That trace is illustrative, not a client log, but the shape is real. The honest risk is the last comment line. When the self-evaluation never clears the threshold, a naive loop exhausts max_hops and answers on weak evidence. You need an explicit "I could not find this" exit, or the loop quietly converts a retrieval failure into a confident wrong answer.

Why agentic RAG can hide a failing pipeline

Agentic RAG does not exempt you from the failure that kills static pipelines. It can hide it longer. The pattern is the one every production team learns the hard way: the demo retrieves perfectly, the corpus grows, the queries drift, and recall collapses quietly while generation quality papers over it. A capable model synthesizes a plausible answer from three mediocre chunks, so nobody notices recall bled from 0.84 to 0.61.

Agentic retrieval makes that worse in one specific way. A self-correcting loop is even better at papering over weak retrieval, because it will reformulate and re-search until it scrapes together something answerable. Your answer quality stays acceptable. Your per-hop recall is rotting underneath, and now you are paying 6x to mask it. The extra hops are buying you cover, not correctness.

The discipline is identical to static RAG, just instrumented per hop. You need a golden set of query-and-relevant-chunk pairs, recall@k measured weekly on each retrieval hop, and a chart that shows the decay before a customer does. I lay out that measurement loop in full in my guide to RAG evaluation. Without it, agentic RAG is a more expensive way to fail silently. The loop is not a substitute for retrieval evals. It raises the stakes on them.

This is also the point at which an external set of eyes earns its cost. If your retrieval layer is the bottleneck and you would rather instrument it correctly than discover the decay from a customer, that is exactly the work Devlyn ships on RAG and knowledge integration, with per-hop recall evals wired in from day one.

When to reach for it, and when not to

Reach for agentic RAG when one-shot retrieval demonstrably fails on the queries that matter. The signal is concrete: your eval set shows static recall is fine on single-fact queries and falls apart on the multi-hop ones, and those multi-hop queries are a meaningful share of real traffic. That is a measured gap an extra hop can close.

Do not reach for it when a cheaper fix is on the table. Often the static pipeline is failing for reasons a loop will not solve: bad chunking, a stale index, a query distribution you never tuned for. Fixing those is cheaper than 6x inference. The order of operations is: tune static retrieval, measure, and only add agentic hops where the residual failure is genuinely multi-hop.

Good fit: multi-hop questions, cross-document reasoning, queries needing live tool calls like SQL or an API lookup mid-answer.
Bad fit: single-fact lookups, latency-critical chat paths, anything where a one-shot pipeline already clears your recall bar.
Prerequisite: a retrieval eval harness and a "could not find it" exit, before you add a single loop.

This is the same decision logic I apply to any agent. Bound the autonomy, name what the loop is allowed to do, and verify the result mechanically. The fuller version lives in my honest accounting of what agents can do today, and the retrieval mechanics, including the embeddings layer underneath all of this, sit in my book on retrieval that survives production.

Frequently asked questions

What is the difference between RAG and agentic RAG?

Standard RAG retrieves once with a fixed pipeline and generates an answer. Agentic RAG lets the model control retrieval: it decides when to search, rewrites queries, judges whether the evidence is sufficient, and re-retrieves in a loop. RAG is a step. Agentic RAG is a decision the model makes at runtime.

Is agentic RAG always better than standard RAG?

No. It wins clearly on multi-hop and ambiguous queries and roughly ties on single-fact lookups, while costing several times more per query and adding seconds of latency. On simple retrieval it is a tax with no return. Use it only where your evals show one-shot retrieval failing.

How much does agentic RAG cost compared to static RAG?

Expect a 5 to 10x cost increase per query, because each answer triggers multiple model calls for planning, iterative retrieval, and self-evaluation instead of one. Latency rises with it, often by several seconds, and heavy re-retrieval can push tail latency toward 20 to 30 seconds.

How do I evaluate an agentic RAG system?

Measure retrieval per hop, not just the final answer. Build a golden set of query-and-relevant-chunk pairs, track recall@k on each retrieval hop weekly, and watch for the loop masking decayed recall with extra searches. Add an explicit "could not find it" exit so a failed loop does not answer confidently on thin evidence.

If you are deciding whether agentic retrieval is worth the cost on a real workload, that is exactly the kind of build a Devlyn's RAG and knowledge integration build ships with evals from day one. We measure where one-shot retrieval actually fails before we add a single loop, so you pay for the hops that earn their keep and kill the ones that do not.

Agentic Coding: What Changes When the Machine Writes Code

Alpesh Nakrani — Sun, 07 Jun 2026 18:30:00 GMT

Agentic coding is the AI-Native software development lifecycle in practice. The machine writes the implementation. The engineer specifies intent and evaluates the diff. Everything that changes about the job follows from that one swap: you stop authoring mechanism and start authoring constraint, then judging whether the generated code honors it.

I have run this loop in production long enough to be specific about what it does to an engineer's day, what it does to the P&L, and where it breaks. This is not the vendor version where you paste a prompt and ship the output. It is the harder, more durable version, and it has costs people are not pricing in. If you want a team that already runs it this way, you can hire AI engineers who have shipped agentic coding in production.

Agentic coding is not the machine helping you code. It is the machine doing the coding while your job contracts to intent and evaluation.

The reflexes that made you a good engineer in 2022 are now partly liabilities. Reading code for syntax. Writing the for-loop yourself because it is faster than explaining it. Trusting that working code is correct code. Each of those reflexes has to be rebuilt for a world where a model produces plausible code far faster than you can validate it.

Key takeaways

If you read nothing else, read these.

Agentic coding moves your job from writing to specifying and evaluating. The agent plans, writes, runs, and revises; you author intent and audit the diff against it.
The spec is the program, and the bottleneck. A coding agent fills every gap you leave with a plausible wrong answer, so bugs migrate into what you forgot to specify.
Validation, not generation, is the new constraint. Commit volume rises three to four times while review stays serial, so the queue backs up and the quality bar quietly drops.
Faster generation is not faster delivery. A 2025 METR trial found experienced developers were about 19% slower with AI tools while believing they were faster.
Evaluation has to be a funded role. "A senior engineer will skim it" is not a plan when the agent ships candidate bugs at machine speed.

What agentic coding actually means

Agentic coding is software development where an AI coding agent plans, writes, runs, and revises code across multiple steps, while a human specifies the intent and evaluates the result. The word that matters is agentic. The model is not autocompleting your next line. It is taking a goal, breaking it into steps, calling tools, reading errors, and iterating until it believes the task is done.

That is a different thing from AI-assisted coding, where you still drive and the model suggests. The distinction is the whole point of the AI-Native thesis: AI-Native means the machine does the job, not that it helps you do yours. Agentic coding is that thesis applied to the one domain where it has matured fastest, and it inherits the same reliability rules as the broader pattern I lay out in the guide to AI agents and agentic workflows.

The progress is real and worth stating plainly. On SWE-bench Verified, a benchmark of real GitHub issues, top coding agents now resolve well over 80% of tasks per the public SWE-bench leaderboard, up from a 1.96% baseline when the original SWE-bench paper landed in late 2023. The labs have already moved on to harder benchmarks because the old one is partly saturated.

The machine can write the code. The open question is whether you can evaluate it fast enough to ship it safely.

The spec is the program, and now it is the bottleneck

When the machine writes the implementation, the artifact you actually author is the specification. I have argued this at length in the spec is the program now: the thing you write before the model runs becomes the source of truth you version and defend. Agentic coding is where that stops being a thesis and starts being a daily constraint.

Here is the mechanism that surprises people. A coding agent will produce a plausible implementation for any gap you leave in the spec. It does not pause to ask; it fills in.

So the bugs migrate. They are no longer in the code the model wrote for things you specified. They are in the code the model wrote for things you forgot to specify.

That moves the hard work earlier. The creative act is now the precise statement of intent, including the edge cases and the explicit list of what not to build. A spec written for a model implementer has to be more exact than one written for a senior human, because the human brings context and the model brings priors.

# Intent (what the agent must implement)

Validate a webhook payload before it touches the queue.

# Constraints the agent must honor

-- Reject any payload without a verified HMAC signature.

-- On a malformed body, return 400 and log; never enqueue.

-- Signature check must run before JSON parsing.

# What NOT to build

-- Do not add a retry loop here; retries live in the consumer.

-- Do not log the raw payload body (it carries PII).

That last block, the explicit exclusions, is the part most people skip and most regressions come from. The model has strong opinions about what a webhook handler "usually" includes. If you do not say "do not log the body," it will helpfully log the body, and you will have shipped a privacy incident with a green test suite.

Evaluating the diff is the new core skill

The reflex that changes most is how you read a diff. You are no longer hunting for syntax errors. The model does not make those. You are auditing whether plausible-looking code actually does what the spec says, clause by clause. That is a spec audit, not a code review, and it is a harder cognitive task than most engineers expect.

This is the judgment economy showing up in your editor. When generation is cheap, the scarce, defensible skill is telling good output from bad. When doing is cheap, deciding is everything, and in agentic coding the deciding is reading machine-authored diffs with the right posture. The teams that win are not the ones generating the most code. They are the ones who can evaluate it fastest without lowering the bar.

The mechanical move that makes this tractable is invariant-first testing. Write tests against the behavior that must always hold, not against the mechanism. The model regenerates the mechanism constantly. If your tests are coupled to implementation, you rewrite them on every run. If they encode invariants, they survive regeneration and actually tell you something when they go red.

You are not reviewing whether the code looks correct. You are auditing whether plausible code does what the spec said, clause by clause.

Where agentic coding breaks

I will be blunt about the failure modes, because the honest accounting is the differentiator. Agentic coding does not break the way the demo suggests. It breaks downstream, where you are not looking.

Validation becomes the bottleneck. The agent writes code faster than your team can review it. AI-generated pull requests now sit in review far longer than human ones, and code review is already the slowest stage of most pipelines. You did not remove the constraint. You moved it from writing to reviewing, and review does not parallelize the way generation does.

This is the trap that catches most teams in their first quarter. The dashboard looks great and commit volume is up three to four times. Then the review queue backs up, reviewers start rubber-stamping to clear it, and the quality bar quietly drops at exactly the moment throughput went up.

A rubber-stamp review is worse than no review, because it launders unevaluated code as approved. The fix is not more reviewers. It is automated evals that catch the failure modes before a human ever opens the diff.

A staff engineer I worked with, Priya, ran the numbers on her team after a quarter on coding agents. Merged pull requests had roughly tripled, from about 40 a week to 120. The open-review backlog had also tripled, and the median PR now sat 31 hours in review instead of 9. The team felt slower, not faster, and they were right.

We stopped measuring throughput and built a pre-review eval gate: schema and invariant checks plus a security lint that blocked any diff carrying a known vulnerability class. Within three weeks the backlog cleared, because reviewers only opened diffs that had already passed the machine.

Security and quality regress quietly. Independent testing has found that a large share of AI-generated code introduces vulnerabilities. Veracode's 2025 GenAI code security evaluation of over 100 models found that about 45% of generated samples introduced an OWASP Top 10 flaw. The agent optimizes for "passes the test," not "is safe to run." If your evals do not cover the failure mode, the agent will sail right past it.

The fast lane can be slower. A 2025 METR randomized trial found that experienced open-source developers were about 19% slower with AI tools, while believing they were 20% faster. The perception gap is the dangerous part. Time saved generating gets spent, and then some, cleaning up. Agentic coding pays off in specific conditions, not universally.

Skill atrophy is a real second-order cost. When the machine writes the implementation, junior engineers lose the reps that used to build judgment. The two-track outcome is visible already: senior engineers capture most of the gains because they can evaluate output; newer engineers ship agent code they cannot yet judge. That is an organizational problem, and it lands later, because the pipeline that built senior judgment ran on exactly the reps the machine now absorbs.

The revenue lens: faster generation, slower validation

Here is the both-seats version, because the engineering decision is a P&L decision. Agentic coding lowers the cost of producing code toward zero. It does not lower the cost of being wrong. A privacy leak, a security CVE, a silent data-corruption bug each costs the same as it always did, and you are now producing candidate bugs at three to four times the rate.

So the unit economics flip. The expensive resource is no longer engineer-hours spent typing. It is senior judgment spent validating, plus the rework and incidents from whatever judgment you skipped. A team that treats agentic coding as "we ship more features now" without funding the validation layer is borrowing against its own reliability. The bill arrives as technical debt and production incidents, on a delay.

A fintech founder named Daniel learned this on a delay. His team shipped an agent-written reconciliation job that passed its tests and ran clean for six weeks, then silently mis-rounded a fraction of cents on about 2,000 transactions a day until a customer caught it. The generation had cost almost nothing; the cleanup, the audit, and the trust repair cost weeks.

The bug lived in a rounding rule nobody specified and no test asserted, exactly the gap a coding agent fills with a confident guess. After that, every money-touching diff went through an invariant test and a named reviewer, and the incident did not recur.

The teams getting real ROI do the unglamorous thing. They invest the saved generation time into evaluation, observability, and tighter specs, and they keep a senior person accountable for every diff that touches money, customer data, or auth. If you would rather have that observability built in from day one than bolted on after an incident, that is the work Devlyn does on AI observability and monitoring. It is the same discipline I argue for across all agent work in my honest accounting of what agents can do today, and at book length in Agents That Actually Work.

How to run the loop so it holds

The agentic coding loop that survives production is short and strict: specify intent, generate a diff, evaluate against the spec, tighten, repeat. Four practices keep it honest.

Write the spec to the model, not past it. State the invariants and the explicit exclusions. Assume the model fills every gap with a plausible wrong answer.
Bound the agent's blast radius. Least-privilege tools, no production write access, a human gate on anything irreversible. The same discipline that keeps agentic workflows safe applies inside the editor.
Test invariants, not mechanism. Your suite should stay green across regenerations and go red only when behavior actually breaks.
Make evaluation a funded role, not a hope. "A senior engineer will skim it" is not a plan at three to four times the throughput.

One more reflex worth naming: stop treating a long, fully autonomous agent run as the goal. A short loop with a human checkpoint between steps beats a long one that wanders, because the compounding error across many unsupervised steps is what produces the confident, plausible, wrong result. Bound the run, check the diff, then let it continue. Demos reward the long autonomous run. Production rewards the bounded one.

The full lifecycle version of this, from intake through incident response, runs on one core move that stays the same everywhere: author intent, generate mechanism, evaluate against intent. The deeper production version of that discipline is the argument of my book Agents That Actually Work. Master the loop and the specific coding agent you use becomes an implementation detail you can swap when a better one ships.

Frequently asked questions

What is agentic coding?

Agentic coding is software development where an AI coding agent plans, writes, runs, and revises code across multiple steps, while a human specifies the intent and evaluates the result. It differs from AI-assisted coding, where the human still drives and the model only suggests. In agentic coding the machine does the implementation; the engineer's job contracts to writing a precise spec and auditing the generated diff against it.

Is agentic coding the same as vibe coding?

No. Vibe coding usually means accepting model output without rigorous evaluation, which is exactly how you ship the 45% of AI-generated code that carries a known vulnerability class. Agentic coding done well is the opposite: a disciplined loop of explicit intent, bounded tools, invariant-based tests, and a senior human evaluating every diff that touches money, data, or auth.

Does agentic coding actually make engineers faster?

Sometimes, and not universally. Benchmarks show agents resolving most real GitHub issues, but a 2026 METR trial found experienced developers were roughly 19% slower with AI tools while believing they were faster. The gain is real for well-specified, well-bounded tasks and disappears when validation and cleanup eat the time you saved generating. Faster generation is not faster delivery unless your evaluation layer keeps up.

What skills matter most in agentic software development?

Writing precise specifications and evaluating machine-authored diffs. When the model writes the mechanism, value migrates to whoever can state intent exactly and tell correct output from plausible-but-wrong output. Reading diffs against a spec clause by clause, and encoding invariants as tests, are now core engineering skills rather than nice-to-haves.

If you are turning agentic coding into a system that has to hold under real load, with evaluation and security built in rather than bolted on, that is the work my team does. See how a Devlyn AI engineering team ships AI-Native software with specs, bounded agents, and evals from day one. The machine writes the code. Making it safe to ship is still the job.

Agentic AI Use Cases and the Constraint That Picks One

Alpesh Nakrani — Sat, 06 Jun 2026 18:30:00 GMT

The right agentic AI use cases are the ones where the task is repetitive, tool-bounded, and high-volume with a checkable outcome. Agents pay off there. They fail where the cost of a confident-wrong action is high and unbounded, because an agent will take that action with the same certainty it brings to a correct one. So the question is not "where could we use an agent." It is "which constraint does this task actually have." Match the use case to the business constraint, and the hype sorts itself out.

I spend my days where engineering meets revenue, currently as CRO at Devlyn, after a decade as a CTO and COO. From that seat I watch the same pattern repeat: a team picks an agentic AI use case because it looked impressive in a demo, not because the underlying task fit what agents are good at. Then the bill arrives: Gartner predicts over 40% of agentic AI projects will be canceled by the end of 2027, citing escalating costs, unclear business value, and inadequate risk controls. Most of those deaths are avoidable, and they start with picking the wrong use case for the wrong reason.

This guide is the test I run before committing engineering time to any agentic AI use case. It names the four properties that make a task a real fit, walks the use cases where the return is honest, marks the ones where ROI is theater, and turns all of it into a build-buy-kill call. If you want the pillar context first, this sits under my field guide to AI agents and agentic workflows, where the narrow-band argument starts.

Agents earn their keep where the task is repetitive, tool-bounded, and high-volume with a checkable outcome. Everywhere else, you are paying for autonomy you cannot afford to trust.

Key takeaways

If you read nothing else, take these five claims with you:

The constraint picks the use case. Repetitive, tool-bounded, high-volume, checkable outcome means an agent can pay off. The reverse means it will not.
Verifiability is the deciding factor. Coding works as an agentic use case because tests grade the output. Tasks with no cheap check are the dangerous ones.
The blast radius of a wrong action is the real cost. A confident-wrong action that is irreversible and expensive kills the ROI no matter how good the average case looks.
ROI is real in the boring middle. Support triage, data extraction, reconciliation, code changes behind tests. ROI is theater in the open-ended, judgment-heavy edge.
Build vs. buy follows the same constraint. Buy the commodity use cases, build the ones tied to your proprietary data and revenue, kill the ones that are demos in disguise.

The constraint that picks the right agentic AI use case

Start with four properties, not a list of industries. An agentic AI use case is a good fit when the task is repetitive, when it is bounded by a small set of well-defined tools, when it runs at high volume, and when the outcome is cheap to check. Hit all four and the economics work. Miss one, especially the last, and you are gambling.

Repetitive matters because agents amortize their setup cost over volume. You spend real engineering effort on tools, guardrails, and evals up front, so that cost only makes sense if the task runs thousands of times, not twice a quarter. Tool-bounded matters because every tool you expose is a thing the agent can do wrong. A small, validated tool set shrinks the surface where a hallucinated call becomes a real action.

The fourth property does the most work: a checkable outcome. Anthropic makes this point about coding in Building Effective Agents, where agents shine because solutions are verifiable through automated tests and the agent can iterate against that feedback. If you can grade the output cheaply, you can let the machine do the work and evaluate the result, which is the whole thesis. If grading the output costs as much as doing the task yourself, the agent saves you nothing.

If checking the answer costs as much as producing it, the agent has not removed the work. It has moved it to your most expensive reviewer.

Where ROI is real, framed by the constraint

The use cases that pay are unglamorous on purpose. They sit in the boring middle, where volume is high and a wrong answer is caught before it does damage. For a wider catalog of what is genuinely shipping, see the companion piece on agentic AI examples; here are the patterns I trust, each tied to the constraint that makes it work.

Customer support triage. High volume, bounded tools (read order history, pull a knowledge-base article, draft a reply), and a checkable outcome because a human or a confidence threshold gates anything irreversible. Anthropic describes this exact shape: classify the input, route it, give the agent read access and a draft action.
Data extraction and enrichment. Pulling structured fields from invoices, contracts, or tickets. Repetitive, high-volume, and checkable against a schema. A failed extraction fails loudly instead of acting on a fiction.
Code changes behind tests. The cleanest case, because the test suite is the check. The agent edits, runs the tests, and iterates. The outcome is verifiable without a human reading every line. More on this in what changes when the machine writes the code.
Reconciliation and routing. Matching transactions, flagging anomalies, routing work to the right queue. Bounded, repetitive, and the agent writes to a review queue rather than the ledger.

Notice the shared trait. In every case, the agent's worst output lands somewhere reversible: a draft, a flag, a staging branch, a queue. That is not an accident, it is the design choice that turns a risky autonomous system into a use case you can put a number on. The enterprise agentic AI programs that survive are the ones built this way, where autonomy is real but the blast radius is small.

A logistics team I advised, run by an ops lead named Priya, wanted an agent to handle inbound delivery exceptions: read the carrier message, classify the issue, and either draft a customer update or flag it. The first version tried to resolve everything, including refunds and reroutes. It looked impressive in the demo, then issued a wrong reroute in its first live week.

We cut it to one bounded job: classify the exception and draft the update, with anything touching money or a reroute routed to a human queue. That version cleared roughly 60% of the daily exception volume on its own, the part that was genuinely repetitive, and left the judgment calls to people. The numbers here are illustrative of the shape, not a published figure, but the lesson is exact: the use case was never the whole task, only the verifiable slice of it.

If you want the agent to handle that slice safely at volume, the missing piece is almost always the check, not the model. You need to know, in production, when a classification drifts or a draft starts going wrong before a customer sees it. That visibility is the work Devlyn builds in on AI observability and monitoring, so the blast radius stays small as the volume grows.

Where ROI is theater

The expensive failures look the opposite. They are low-volume, judgment-heavy tasks where the cost of a confident-wrong action is high and hard to reverse. An agent that autonomously sends customer emails, moves money, makes hiring calls, or commits to production without a gate is not a use case. It is a liability with a good demo.

The reason is structural, not a tuning problem. An agent does not know when it is wrong; it produces a bad action with the same fluency as a good one. When the outcome is cheap to check, that does not matter much because you catch the bad one. When the outcome is expensive to check and the action is irreversible, a single confident mistake can erase a quarter of savings.

This is why "a human reviews it" is not a plan. At high volume the reviewer becomes a bottleneck, then a rubber stamp, then the thing that failed. I worked the math on that bottleneck in my book Human in the Loop Is Not a Plan.

The data backs the caution. As of early 2026, industry reporting put the share of enterprise agent pilots reaching production at scale in the low double digits, with the rest stalling on unclear success criteria, weak tool and data access, and eval coverage that drifts. The common thread is not bad models. It is good models pointed at tasks that never had a checkable outcome in the first place.

A worked example: the same task, two constraints

Take one task and change a single property. The lesson lives in that change.

# Use case A: agent drafts refund replies, human approves over $50

action = agent.draft_refund(ticket)

if action.amount <= 50 and action.confidence > 0.9:

auto_send(action) # reversible, low blast radius

else:

queue_for_human(action) # expensive case gated

This works because the outcome is checkable and the blast radius is capped. Small refunds are reversible and high-confidence, so the agent runs free, while large or low-confidence cases route to a person. The agent handles the volume; the human handles the judgment. Now change one property and watch the ROI invert.

# Use case B: same agent, no cap, no gate, "to move faster"

action = agent.draft_refund(ticket)

auto_send(action) # unbounded amount, no confidence gate

# one confident-wrong $9,000 refund erases a month of savings

Same model, same prompt, same accuracy. The only difference is the constraint: Version A is a use case, and Version B is the headline that gets the project canceled. The decision that mattered was never the model; it was where you drew the line the agent is not allowed to cross. That principle generalizes; I cover it more fully in an honest accounting of what agents can do today.

Build, buy, or kill: the constraint decides that too

Once you can read the constraint, the spend decision gets easier, and this is where the engineering call becomes a revenue call. The same four properties that pick a use case also tell you whether to build it, buy it, or kill it.

Buy the commodity cases. Support triage, meeting notes, generic document extraction. These are bounded, well-understood, and someone already sells a mature agent for them. Building your own is paying engineers to reinvent a category.
Build the ones tied to your data and revenue. If the use case runs on proprietary data, touches your core workflow, and the quality of the output moves a revenue number, that is yours to build. The moat is the data and the evals, not the agent loop.
Kill the demos in disguise. Low-volume, no checkable outcome, high blast radius. If it only ever impressed people in a meeting, it will not survive contact with a customer. Killing it early is the highest-ROI decision on the list.

The honest trade-off is that buying gives up control and visibility into failure modes, while building costs you the up-front spend on tools, guardrails, and an eval harness before you see a dollar back. There is no free version, so the point of the constraint is to spend that effort only where the task can actually return it. When the use case is real and tied to revenue, a build is worth it, and that is exactly the work Devlyn's engineers ship, with evals from day one. If you have a use case that clears the bar, hire AI engineers who have built agentic systems in production rather than learning the failure modes on your dime.

Frequently asked questions

What are the best agentic AI use cases for enterprises?

The best enterprise agentic AI use cases are repetitive, high-volume tasks with bounded tools and a cheap way to check the result: customer support triage, data extraction, code changes behind a test suite, and reconciliation or routing. These work because the agent's worst output lands somewhere reversible, like a draft or a review queue, instead of a customer or a ledger.

Where do AI agents fail?

AI agents fail on low-volume, judgment-heavy tasks where a confident-wrong action is expensive and hard to reverse. An agent produces a bad action as fluently as a good one and does not know the difference. Without a cheap outcome check and a capped blast radius, a single confident mistake can erase the savings from thousands of correct runs.

How do I decide where to use AI agents versus buy a tool?

Read the constraint first. Buy the commodity, well-understood use cases where a mature agent already exists. Build the ones that run on your proprietary data and move a revenue number, because the moat is the data and the evals. Kill anything that is low-volume with no checkable outcome and a high blast radius.

How is an agentic AI use case different from a generative AI use case?

A generative use case produces output you read; an agentic use case takes actions in the world. Actions carry consequences that generation does not, which is why verifiability and blast radius matter so much more for agents. I draw the full distinction in what is genuinely shipping and in the field guide to agentic workflows.

Pick the constraint before the use case

Every durable agentic AI use case I trust passes the same test: repetitive, tool-bounded, high-volume, with an outcome you can check cheaply and a wrong answer that lands somewhere reversible. That is the constraint that separates real ROI from theater, so pick it before you pick the use case, and you will avoid most of the failures that get projects canceled. If you want the deeper version of this argument, the patterns and the failure modes, I wrote the book Agents That Actually Work. When you have a use case that clears the bar and ties to revenue, that is the moment a build pays for itself.

How to Build an LLM Evaluation Framework

Alpesh Nakrani — Fri, 05 Jun 2026 18:30:00 GMT

A good LLM evaluation framework tests what will actually break in production. That means four parts working together: a golden set sampled from real traffic, task-specific metrics instead of generic ones, blinded human rubrics, and a measurement cadence that catches drift before a customer does. Everything else is decoration.

I have watched teams ship LLM features behind an eval suite that reported ninety-percent accuracy and a green CI badge, then field a Sev-1 the same week. The suite was not wrong. It was measuring the wrong distribution. An LLM evaluation framework is not a dashboard you bolt on at the end. It is the harness that decides whether a model candidate earns a production deploy, and it has to be built with the same care as the feature it gates.

This is the build process I trust. It sits under my complete guide to LLM evaluation and extends the argument in my essay on evals that predict production, which covers the sampling problem in depth. Here I focus on assembly: what the framework must contain, and how to stand each piece up.

Key takeaways

An LLM evaluation framework is the gate between a model candidate and a deploy, not a dashboard you add at the end.
It has five parts: a frozen golden set from real traffic, task-specific metrics, an offline CI layer, an online layer, and a calibrated scoring engine behind a fixed gate.
Sample the golden set from production and oversample the hard tail; a set built from imagined cases passes right up until launch.
An automated judge handles volume only after it hits 85 to 90 percent agreement with a blinded human rubric; below that it is guessing.
Score components, not just the end-to-end answer. A strong model papers over retrieval collapse until the day it can't, and the gate is cheaper than the churn.

An eval framework is not a dashboard you bolt on at the end. It is the gate between a model candidate and a deploy.

What an LLM evaluation framework must contain

Most eval content lists metrics. That is the part that matters least. A framework is the system around the metrics, and a usable one has five components. Skip any of them and the number you report stops meaning anything.

A golden set sampled from real production traffic, frozen and versioned, never grown organically.
Task-specific metrics chosen for the failure modes that hurt this feature, not a generic accuracy score.
An offline layer that runs in CI and gates the deploy, plus an online layer that scores live traffic with the same metrics.
A scoring engine: an automated judge for volume, calibrated against a blinded human rubric for ground truth.
A cadence and a gate: a fixed threshold agreed before the run, and a refresh schedule that catches drift.

The pattern is now standard across teams that ship LLM features for real. The recommended production setup moves a single golden dataset through local development, a pre-merge gate, a deploy gate, and live monitoring, with the same evaluator at every stage so pre-launch and post-launch scores compare directly (Datadog, LLM evaluation framework best practices). One ruler, four checkpoints.

Building the golden set from real traffic

The golden set is the foundation, and it is where most LLM evaluation frameworks go wrong on day one. Teams build the set bottom-up: a developer writes cases while building the feature, a PM adds a few during review, and the set accumulates. The result reflects what the team imagined users would do, not what users do. Those two distributions diverge fast.

Sample the set from production instead. Pull a stratified slice of real requests, freeze it as a named artifact, and version it like code. A practical size is 200 to 500 cases that cover the feature's full operational envelope, each pairing an exact input with a reference output (Maxim, golden dataset guide). Build it from real failures, not synthetic happy-path examples.

Over-weight the hard tail on purpose. The cases that break production are not randomly distributed; they cluster. I deliberately oversample four buckets: bottom-quartile model confidence, cases a human reviewer corrected, syntactically adversarial input, and anything that previously caused an incident. That last bucket is the one teams skip because the incident feels resolved. It is not resolved until a future model passes those exact cases on a held-out set.

Freezing the set has a consequence worth stating plainly: your score on an older frozen set can only go down. There is no sneaking in easy cases to lift the number. That is the point. You want a fixed ruler, not a rubber band. The trade-off is honest: oversampling the tail makes your aggregate accuracy look worse than a uniform sample would. Ship the model that passes the hard set, not the one with the prettiest average.

Choosing task-specific metrics

The metrics are the part of the framework you tune to the feature, and a generic accuracy score is almost always the wrong choice. The question is not "is the output good." It is "what does broken look like for this task, and what number goes negative when it happens." Answer that first, then pick the metric.

Different tasks fail differently, so they need different instruments. A few that map cleanly to common LLM features:

Retrieval (RAG): recall and precision on the retrieved context, scored separately from answer quality. Faithfulness, whether the answer is grounded in what was retrieved, catches confident hallucination.
Classification or extraction: per-class precision and recall, because aggregate accuracy hides a class that fails completely while the average stays high.
Agents: per-step success and tool-call correctness, not just whether the final answer landed. A lucky end result over a broken trajectory is not success.
Open-ended generation: a rubric scored by a calibrated judge, plus a hard safety and refusal check that blocks regardless of quality.

Pick the two or three metrics that map to real failure, set a floor for each, and resist the urge to track twenty. A framework with too many metrics produces a dashboard nobody reads and no clear gate. The discipline is choosing the few numbers you will actually block a deploy on. I cover the full menu in the metrics that matter and the ones that lie. Teams running this at scale, like Booking.com's engineering org, report the same lesson: a few production-anchored metrics beat a long dashboard nobody gates on.

The offline and online layers

An LLM evaluation framework needs two layers, because a model can pass every offline check and still rot in production. Offline catches regressions before launch. Online catches drift after it.

The offline layer runs against the frozen golden set, in CI, on every model candidate. It produces a go/no-go number for the deploy gate. This is the layer that answers "is this candidate better or worse than what we shipped six months ago, on exactly the same questions." Most teams cannot say that sentence with confidence. The frozen set is what lets you.

The online layer scores live traffic with the same metrics, then watches for drift on top of those scores. Sample 5 to 10 percent of real requests, score them with your automated evaluator, and alert when the distribution shifts (OpenObserve, LLM monitoring best practices). Use statistical monitoring for input distribution and semantic drift detection for output meaning. You usually want both.

Here is why two layers and not one. End-to-end answer quality hides component failure. A RAG answer can read well while retrieval has quietly collapsed, because a capable model papers over thin context. If you only score the final answer, recall can fall for weeks before the answer quality metric notices. I have watched retrieval decay in month three while the surface metric stayed flat. Score the components, not just the output.

End-to-end answer quality hides retrieval failure. A strong model papers over thin context until the day it can't.

Automated judge versus human rubric

You cannot put a human on every output; that does not scale and turns the reviewer into a bottleneck, then a rubber stamp. You also cannot trust an automated judge you have not validated. The framework needs both, in the right roles: the automated judge handles volume, the human rubric defines ground truth and keeps the judge honest.

LLM-as-a-judge is the standard way to score at volume, but it is not free of failure modes. Research has documented position bias, where the judge favors a response based on where it sits in the prompt, and self-inconsistency across repeated runs (Judging the Judges, arXiv 2406.07791). Treat the judge as an instrument that needs calibration, not as a source of truth.

Calibration is the step that makes a judge usable. Validate it against a human-labeled slice of the golden set, and only trust it for the metrics where it hits 85 to 90 percent agreement with your raters. Below that, the judge is guessing and you do not know it. Re-run the calibration whenever you change the judge model or the rubric.

For the human rubric, blind it. Strip the model version, shuffle output order, and plant a known proportion of reference human responses into the batch unlabeled. If raters start scoring the planted human answers below the bar they apply to model output, the rubric has drifted and you stop before scoring anything else. I treat rater disagreement on a cluster as a rubric signal, not noise. The full blinding protocol and the cases where rubrics quietly collapse are in A Field Guide to Evals.

The honest trade-off: blinded human scoring is slow and expensive, and it does not scale to every release. That is exactly why "a human reviews it" is not a quality system on its own. The human calibrates the rubric and the judge; the judge does the volume. I make that argument in full in Eval-Driven Development, and the staffing version of it in why a human in the loop is not a plan.

The cadence and the gate

A framework without a gate is a report nobody acts on. The gate is a fixed threshold, agreed by product and engineering before the run, not after. Setting the threshold once you see the result is not evaluation; it is rationalization.

Here is an abbreviated output from an eval runner against a frozen set. The numbers are realistic, not from a specific live system.

# offline eval against frozen golden set

python -m eval.runner \

--suite golden-set-2026-w24-v2.jsonl \

--model prod-candidate-2026-06-15 \

--judge judge-calibrated-v4 \

--rater-pool senior-3

# results summary

cases evaluated 412

retrieval recall@5 0.78 # floor 0.80, FAIL

answer faithfulness 0.94 # threshold 0.90, pass

judge vs human agree 0.88 # floor 0.85, judge trusted

human disagree 5.9% # threshold 8.0%, pass

adversarial tail 16.7% # threshold 18.0%, pass

p95 latency 1,910 ms # +130 ms vs baseline, flag

verdict GATE BLOCKED # recall@5 below floor

Read that output and notice what the framework caught. The end-to-end answer faithfulness passed at 0.94. If that were the only metric, this candidate ships. But retrieval recall@5 fell below its floor, so the gate blocks the deploy. The strong model was masking a retrieval regression that a single answer-quality score would have hidden until production. That is the component-scoring discipline earning its keep.

Cadence is the second half. A frozen set decays as the world changes, so refresh it on a schedule, version each refresh, and run the old and new set in parallel for an overlap period to compare. The online layer's drift alerts tell you when to refresh early. This is where the framework connects to revenue: in an LLM product, a bad launch is not one incident, it is the erosion of the trust that makes the product sellable at all. The gate is cheaper than the churn. If you want that gate run as managed infrastructure, that is the core of Devlyn's AI observability and monitoring work.

FAQ

What should an LLM evaluation framework actually measure?

It should measure the failure modes that will break this specific feature in production, not a generic accuracy score. For RAG, score retrieval recall separately from answer quality. For agents, score each step and tool call, not just the final result. The anchor metric I report to leadership is model-versus-trusted-human disagreement on a frozen, production-sampled set, tracked over time, because it is anchored to a fixed distribution and to human performance.

How do I build a golden set for LLM evaluation?

Sample it from real production traffic rather than writing synthetic cases. Take a stratified slice of 200 to 500 real requests, oversample the hard tail (low confidence, human corrections, adversarial input, past incidents), pair each input with a reference output, then freeze and version the set. Never grow it organically. Cut a new version when you need new cases.

Is LLM-as-a-judge reliable enough to gate a deploy?

Only after you calibrate it. Validate the judge against a human-labeled slice and trust it only for metrics where it hits 85 to 90 percent agreement with your raters. Documented biases like position bias and self-inconsistency mean an uncalibrated judge can be confidently wrong. Use the judge for volume and a blinded human rubric for ground truth.

What is the difference between offline and online LLM evaluation?

Offline evaluation runs against a frozen golden set in CI and gates the deploy before launch. Online evaluation scores a sample of live traffic with the same metrics and watches for drift after launch. You need both: offline catches regressions, online catches the distribution shift that no frozen set predicts forever.

If you are standing up your first LLM evaluation framework, start with the golden set and the offline gate; those two pieces catch most of what kills launches. When you are ready to make evaluation the way your team works rather than a step at the end, Eval-Driven Development is the long-form version of this harness. Build the framework that predicts production, freeze it, and trust the number it gives you over the one you wished for.

AI-Native means the machine does the job

Alpesh Nakrani — Thu, 04 Jun 2026 18:30:00 GMT

For three years we called everything "AI-assisted" and that framing let us stay comfortable. Autocomplete graduated to drafts; drafts became first passes; first passes started shipping. At every step we kept a hand on the wheel and told ourselves the human was still doing the work. That story is over. The question now is whether you have noticed.

AI-Native is a harder claim than AI-assisted, and it is a different one. The machine does the whole job. Not most of it. Not a draft you clean up. The whole job, end to end, with the human's role contracting to a single surface: judgment. Specifying what good looks like before the machine starts. Evaluating whether the output meets it when the machine stops. Owning the call either way. That is it. That is the entire human contribution in an AI-Native workflow, and it is a smaller surface than most teams are used to occupying.

I have spent the last two years living inside that shift, first as CTO building systems on it, now as CRO at Devlyn, where we build AI-Native engineering teams as a service. I have watched the definition get softened in every direction: by vendors who want to sell to teams not ready to change, by leaders who want the story without the restructuring, by engineers who want the credit without the accountability. The softening is costly. Let me explain why the sharp definition matters and what it actually asks of you.

Key takeaways

AI-Native is not AI-assisted. AI-assisted means a human drives and the machine helps. AI-Native means the machine owns a complete unit of work and the human owns the judgment around it.
The human role contracts to one surface: judgment. Specify what good looks like before the machine starts, evaluate whether the output meets it when the machine stops, and own the call either way.
The spec becomes the artifact. When the output is wrong, you fix the spec that produced it and regenerate, rather than reaching for the keyboard to patch the code yourself.
When generation is cheap, evaluation is scarce. The teams that win are not the ones that generate faster; they are the ones that evaluate better, which is why this work needs senior judgment, not hidden juniors.
Name what you will never delegate. Every AI-Native system needs an explicit list of decisions a human owns unconditionally, with a named human accountable for each.

The soft definition is a budget leak

The comfortable version of AI-Native looks like this: a developer uses Copilot or Cursor, writes maybe thirty percent of the code by hand, reviews and accepts the rest. Velocity goes up. The team says it is AI-Native. Leadership puts it in the deck. Nothing structural changes.

That is AI-assisted. It is useful. It is not AI-Native. And the difference is not semantic. It is organizational. When a human is still driving, still deciding each turn, still authoring intent line by line even if the syntax is generated, you need the same headcount, the same supervision layers, the same review bandwidth. You have bought a faster horse. The margin improvement is real but bounded.

AI-Native means the machine owns a complete unit of work: a feature, a test suite, a documentation pass, a code review, a customer-facing summary. The human defines the unit and evaluates the result. They do not execute the steps in between. That change in the locus of execution is what makes the organizational math different. The cost per unit of output falls not by thirty percent but by an order of magnitude, because you stop paying for the execution time of a human on every loop. You pay instead for the judgment calls at the boundary: was the spec clear enough? Did the output meet it? What do we do when it did not?

AI-Native is not "AI did most of it and I checked." It is "the machine owns the doing; I own the deciding." Those are different companies, and the gap between them shows up in your org chart before it shows up in your P&L.

The teams I see struggling most are the ones who adopted AI tooling without adopting the corresponding shift in how they staff and supervise. They end up with a novel bottleneck: highly paid senior engineers spending their days reviewing machine-generated output that a less experienced engineer would have caught faster, because the experienced engineer was not hired to review, they were hired to build. The machine changed what needed building. The org never updated.

What the sharp definition actually means

Here is what I mean when I say the machine does the whole job. A product engineer writes a spec: user intent, acceptance criteria, edge cases, integration constraints. The spec goes to a model. The model produces an implementation. The engineer reads the diff, runs the eval suite, and makes one of three calls: ship it, send it back with corrections, or reject it and rewrite the spec. That is the loop. The engineer never touches the implementation code except to read it.

That sounds extreme. It is not. The spec becomes the artifact you maintain, not the code. When the code is wrong, you do not fix the code, you fix the spec that produced the wrong code, then regenerate. This inverts a reflex that most engineers have built over a decade: the reflex to reach for the keyboard and fix it yourself. Breaking that reflex is the single hardest cultural change AI-Native requires, and it is genuinely hard. It feels irresponsible the first twenty times. Then you watch the model re-generate a corrected implementation in forty seconds and you stop feeling irresponsible. You start feeling like you finally understand where your time belongs.

The judgment that remains is not shallow. Specifying intent precisely enough for a machine to act on is a skill. Most specs are not good enough. They are ambiguous about the edge cases that matter, silent on the failure modes the author did not imagine, confident about requirements that were never verified with the customer. Writing a machine-executable spec is closer to writing a proof than writing a ticket. It demands that you understand the problem completely before anyone touches the keyboard, which, it turns out, is how good engineering was supposed to work before we normalized figuring it out as we went.

What contracts, and what expands

The doing contracts. Writing the implementation, building the first draft, running the routine path, that work moves to the model, and the marginal cost of it falls toward zero. This is not a prediction. It is already true for code, for test generation, for documentation, for code review commentary, for architecture diagrams, for incident summaries. The question is not whether the machine can do it. The question is whether your team is structured to take the handoff.

The deciding expands. When generation is cheap, the scarce input becomes the ability to tell good output from bad, quickly, at scale, across cases the model has never seen. That is taste. It is domain knowledge. It is a measurement discipline, evals, human review rubrics, production monitoring, that most teams have not built yet because they never needed it when a human was executing every step. Taste and measurement do not get automated away. They become more valuable as generation becomes cheaper, because the ratio of output to evaluation capacity tips toward evaluation. You can generate more than you can confidently review, unless you build the review infrastructure deliberately.

Senior engineers are the correct resource for that review, not because they are expensive and therefore prestigious, but because evaluation requires the full context of the system: the architectural constraints, the production history, the customer expectations, the edge cases that only appear at scale. A junior engineer cannot catch what they have not yet learned to look for. This is why the model we run at Devlyn is senior engineers only. No juniors hidden behind AI. Not because we do not believe in developing talent, but because AI-Native work requires a reviewer whose judgment I would defend in a room full of skeptics. That bar is not about years of experience. It is about whether the person reading that diff truly understands what it touches and what could go wrong.

When generation is cheap, evaluation becomes the scarce resource. The teams that win are not the ones that generate faster. They are the ones that evaluate better.

What also expands: the importance of knowing what you will never delegate. Every AI-Native system has a set of decisions that belong to a human unconditionally. Not because the machine cannot generate an answer, it can always generate an answer, but because the organization is not willing to own the consequences of a wrong machine answer in that domain. Security decisions. Decisions about customer data boundaries. Architectural choices that will be load-bearing for five years. Calls that require regulatory accountability. The moment you are AI-Native, you need an explicit list of what is not. Most teams skip that list and discover the omission after an incident.

What this asks of a team

Three things, in order, and skipping the second one because the first one went well is how teams get into trouble.

First: specify intent precisely enough that a machine can act on it. This is harder than it sounds. A spec that would work fine as a Jira ticket for a developer will fail as a prompt for an agent. It is missing the implicit knowledge the developer would have brought: the codebase conventions, the failure patterns they have seen before, the stakeholder preferences the ticket author assumed were obvious. Making that knowledge explicit is work. It requires the person writing the spec to know the system well and to anticipate what the model will not know to ask. The spec is the program. Treat it with the same rigor you would bring to the code.

Second: build evaluation that keeps pace with autonomy. "A human reviews it" stops scaling the moment the machine outruns the reviewer, which happens faster than teams expect once they are generating at machine speed. You need automated evals that catch the failure modes you know about, human review rubrics that catch the ones you do not, and a measurement cadence that surfaces drift before it becomes an incident. The hardest part of this is that good evals require you to know what failure looks like before it happens, which requires domain expertise and production experience. This is not a task you can delegate to the tool that is generating the output.

Third: name the decisions you will never delegate, and staff them accordingly. This is an explicit list, maintained by a person with authority, reviewed on a schedule. Not a vague principle but a named set of decision types with a named human accountable for each one. When that list does not exist, the default is that everything gets delegated eventually, because the pressure to go faster is constant and the machine is always available. The list is the only thing that holds the boundary.

Do all three, and AI-Native becomes an operating model rather than a marketing claim. Skip any one of them, and you have bought a faster way to ship work nobody is qualified to evaluate.

The cost of overselling it

I want to say something about timelines, because I hear a lot of promises in this space that I do not believe, and I think the damage from those promises is underpriced.

AI-Native does not mean instant. It does not mean you can skip architecture. It does not mean a team of two can build what used to take twenty, in a quarter of the time, at the same quality. Sometimes those claims are approximately true in narrow circumstances. More often they are fantasy, told by people who are selling the future as if it were the present and hoping the customer does not notice until after the contract is signed.

The position I hold, and the position I hold my team to: we will not oversell AI. We will not promise fantasy timelines. We will not trade quality for speed and call the difference AI leverage. Ownership over hours. Outcomes over velocity. That is not a conservative stance on AI, I believe deeply in what these systems can do. It is a conservative stance on honesty, which I think is the only sustainable basis for a client relationship in a category that has already spent a lot of its credibility early.

AI-Native engineering at its best is senior engineers who understand what the machine can own, who write specs the machine can execute, who evaluate outputs with the rigor those outputs deserve, and who are accountable for the result in production. That is a high bar. It is supposed to be. The machine took the work and left us the judgment, but judgment is not a consolation prize. It is the whole game, and it is harder to hire for, harder to develop, and harder to fake than the execution it replaced.

Where the judgment economy begins

I wrote a book called The Judgment Economy because I think this shift has an economic structure that most people are not tracking yet. When execution is cheap, the market value of execution falls. When evaluation is scarce, the market value of evaluation rises. That is not a soft observation about the future of work, it is a pricing signal that is already appearing in how the best engineering teams are staffed and compensated.

The engineers who are most valuable in an AI-Native environment are not the ones who generate the most code. They are the ones who can tell, quickly and reliably, whether generated code is correct, in the full sense: correct for the use case, correct for the scale, correct for the security model, correct for the production environment, correct for the customer's actual expectation versus their stated one. That evaluation capability is not evenly distributed. It accumulates with experience in a specific domain. It is not easily transferred between contexts. It is not replicable by a model, because the model is the thing being evaluated.

This is what I mean when I say judgment is the whole game. Not that AI cannot do remarkable things, it can, and it will do more. But the claim that AI makes judgment obsolete is exactly backwards. When execution is abundant, judgment is what remains scarce. Scarce things are valuable things. The question is whether you are building the organizational infrastructure to develop, apply, and defend that judgment, or whether you are hoping the tool will handle it and calling that AI-Native.

The definition matters because the soft one is expensive. Not eventually. Now. In the org you are building, the people you are hiring, the contracts you are signing, and the clients you are making promises to. Get sharp on what AI-Native means and the rest of the decisions become clearer. Stay soft on it and you will keep paying for both the machine and the human to do the same job, and wondering why the economics never quite work out.

Frequently asked questions

What is AI-Native?

AI-Native means the machine does the whole job, end to end, while the human's role contracts to a single surface: judgment. The human specifies what good looks like before the machine starts, evaluates whether the output meets that bar when the machine stops, and owns the call either way. It is distinct from AI-assisted, where a human still drives and the machine only helps.

What is the difference between AI-Native and AI-assisted?

In an AI-assisted workflow a human is still driving, still authoring intent line by line, still executing the steps even if the syntax is generated, so you need the same headcount and supervision. In an AI-Native workflow the machine owns a complete unit of work and the human only defines the unit and evaluates the result. The difference is organizational, not semantic, and it shows up in the org chart before it shows up in the P&L.

Does AI-Native make engineering judgment obsolete?

No, the opposite. When execution is abundant, judgment is what stays scarce, and scarce things are valuable. The engineers who matter most in an AI-Native environment are the ones who can tell quickly and reliably whether generated output is correct for the use case, the scale, the security model, and the customer's actual expectation. That evaluation capability accumulates with domain experience and cannot be replicated by the model, because the model is the thing being evaluated.

AI Agents and Agentic Workflows: An Honest Field Guide

Alpesh Nakrani — Wed, 03 Jun 2026 18:30:00 GMT

An AI agent is a system where a language model is given a goal, a set of tools, and the freedom to decide its own next action based on what it observes. An agentic workflow is that loop running toward an outcome: the model plans, calls a tool, reads the result, adjusts, and repeats until it stops. The difference from a chatbot is simple. A chatbot answers. An agent acts.

That distinction is the whole story. Generation produces text you read. Agentic workflows produce actions with consequences. And actions are where the trouble lives. I have spent two years putting these systems into production, and the honest truth is this: agents earn their keep in a narrow band of tasks, and that band is smaller than the demos suggest.

This is the field guide I wish someone had handed me before I shipped my first agent. It defines the terms, draws the line between agentic and generative AI, names the failure modes nobody markets, and walks the design patterns that actually hold at 3am. I write it from the seat where engineering meets revenue, because the question "should this be an agent?" is always both a technical and a P&L question.

A chatbot answers. An agent acts. Everything hard about agentic workflows comes from that one word: act.

Key takeaways

If you read nothing else, read these.

Agentic workflows are loops, not prompts. The model decides its next action, calls tools, and iterates toward a goal. Autonomy is the point and the problem.
Reliability compounds downward. A 1% per-step error rate becomes a 63% chance of failure across 100 steps. Chain five agents at 95% each and end-to-end success drops to about 77%.
The narrow band is bounded, reversible, verifiable, and tool-scoped. Remove any one property and the agent fails quietly.
"A human reviews it" is not a plan. You evaluate an agentic workflow with an eval harness on the trajectory, not just the final answer.
Agents trade latency and cost for autonomy. Sometimes that trade pays. Often a single well-prompted model call is cheaper and more reliable.

What are AI agents and agentic workflows?

An AI agent is software that pursues a goal by taking actions in a loop, using a language model to decide each step. Agentic workflows are the orchestrated runs of that loop across tools, data, and time. The model is the reasoning engine. The tools are how it touches the world: APIs, databases, file systems, other models.

The word "agent" has been stretched past meaning by marketing. So I use a strict test. If the system follows a fixed code path that a developer wrote, it is a workflow. If the model itself chooses what to do next at runtime, it is an agent. Anthropic draws the same line in its guide to building effective agents: workflows orchestrate models through predefined paths; agents let the model direct its own process. Most production systems are workflows with a thin agentic layer, and that is usually the right design.

The reason the distinction matters is control. A workflow fails the way ordinary software fails: a missing field, a timeout, an unhandled exception. An agent fails in ways that surprise you. It pursues the wrong goal with confidence. It loops. It hallucinates a tool output and proceeds as if it were real. Understanding that failure surface is the difference between a useful deployment and one that breaks in front of a customer, and it is exactly the gap a team that has shipped agents in production knows how to close, the kind you get when you hire AI engineers who have done it before. I made the broader case for this in my essay on an honest accounting of what agents can do today. The principles of building AI agents follow directly from that failure surface.

Agentic AI vs generative AI: output vs action

Generative AI produces an artifact: a paragraph, an image, a function. You read it, judge it, and decide what to do. Agentic AI takes the action itself. That single shift, from producing output to executing action, is what introduces risk that generation never had.

When a model drafts an email, the worst case is a bad draft you discard. When an agent sends the email, the worst case is a sent email you cannot recall. Generation is reversible by default because nothing happens until a human acts on it. Agentic workflows close that gap, and in closing it, they remove the human checkpoint that quietly caught most errors.

Generation gives you a draft you can throw away. An agent already pressed send. The risk is not in the model - it is in the action you let the model take.

This is why I treat "agentic" and "generative" as different engineering problems, not points on a spectrum. Generative systems need good prompts and good evals on output quality. Agentic systems need all of that plus guardrails on actions, observability on reasoning, and a recovery plan for when the loop goes wrong. The fuller comparison, with the specific failure modes, sits in the supporting piece on agentic AI versus generative AI. For the pillar, hold onto one idea: action carries consequence, and consequence is what you design around.

When agents actually work, and when they don't

Agents earn their keep on tasks with four properties: bounded, reversible, verifiable, and tool-scoped. This is the honest core of the guide, because nobody selling an agent platform will tell you where their product fails.

Bounded means a clear start and stop. "Triage these 200 tickets and flag refund cases for review" is bounded. "Run our support queue" is not. Reversible means a mistake can be undone fast. Drafting is reversible; sending and charging are not. Verifiable means you can check the output mechanically, not by squinting at it. A test suite passes or fails. A schema validates or does not. Tool-scoped means least privilege: the agent holds exactly the tools it needs and no send key, no write access to the primary, nothing whose blast radius you cannot afford.

Remove any one of those and you leave the band. An agent that is bounded, verifiable, and tool-scoped but not reversible is one bad step from a real incident. The math is unforgiving here. Carnegie Mellon's TheAgentCompany benchmark found the best model completed only 24% of realistic multi-step office tasks autonomously, with failure rates near 70% as complexity rose. Production data tells the same story: an analysis of thousands of deployed agents reported a 56.6% success rate across millions of real runs.

The reason is compounding. Reliability multiplies down a chain. A 1% error per step becomes a 63% chance of at least one failure across 100 steps. Wire five agents together at 95% reliability each and your end-to-end success falls to roughly 77%, as the multi-agent reliability math shows plainly. Longer chains are not more capable. They are more fragile.

So the practical rule: keep the autonomous span short, the actions reversible, and the checks mechanical. The tasks I have watched succeed are first-pass triage, structured data extraction with schema validation, draft generation a human edits before sending, and advisory monitoring that reports an anomaly rather than fixing it. The tasks that fail are long-horizon research, ambiguous planning, anything negotiating with a real human counterparty, and anything irreversible in money or law. The supporting pieces on agentic AI use cases and shipping agentic AI examples go deeper, but the test above will save you most of the pain.

A founder I advised, Maya, wanted an agent to "run onboarding" for her SaaS: read each new signup's setup, configure their workspace, email them, and book a call. It demoed beautifully and broke in week two, when it auto-configured a customer's billing tier from a misread field and emailed them a wrong quote. We cut it down to one bounded, reversible job: draft the onboarding email for a human to send. That version has run for months without an incident. The whole task was an agent; the part that earned its keep was a sliver of it.

Before committing engineering time, I run a candidate task through five questions. Can I write the stop condition in one sentence? If a step goes wrong, can a person undo it in under five minutes? Can I check the output with code rather than a careful read? Does the agent need only tools whose worst case I can absorb? And is there a single decision in here I would never delegate, that I can carve out for a human? A task that clears all five is in the band. A task that fails two or more is a workflow, a single model call, or a project for later, not an agent.

Agentic design patterns that hold in production

A handful of patterns survive contact with production. They share one trait: each bounds the model's autonomy at the point where autonomy is dangerous.

Tool use with structured schemas. The model selects a tool, you validate the parameters, you execute, you feed the result back. The schema is the guardrail. A malformed call is caught before it touches anything.
Plan-then-execute with checkpoints. The model drafts a plan, a check (sometimes a human, sometimes a rule) approves it, then execution proceeds step by step. The plan is reviewable before any action lands.
Reflection tied to an eval, not a vibe. The model critiques its own output against a concrete check before proceeding. Reflection only helps when the critic has a real rubric. Self-review against "does this seem good?" mostly launders confidence.
Human-in-the-loop at named decisions. Not a human watching everything, which scales terribly. A human at the specific decisions you decided in advance you will never delegate.

That last one is the discipline most teams skip. Putting a human in the loop everywhere turns the reviewer into a bottleneck, then a rubber stamp, then a liability. I argue the full version of that trap in my book Human in the Loop Is Not a Plan. The fix is to name the decisions that always require judgment, and let the agent own only what falls outside that list. The supporting catalog of agentic design patterns carries every variant; the patterns above are the load-bearing ones.

Two anti-patterns deserve a warning, because they look sophisticated and fail expensively. The first is the open-ended multi-agent swarm: a planner spawning sub-agents that spawn more sub-agents. It demos beautifully and collapses under the compounding math above, because every hop multiplies another reliability discount onto the result. The second is reflection without a ground truth. An agent asked to grade its own work against no external check tends to approve itself, since the same model that produced the error rarely catches it. Reflection earns its place only when the critic holds a test, a schema, or a rule the generator does not control.

The patterns that hold share a shape worth naming: they make the model's choices observable and reversible at the moment of action. Tool schemas catch a bad call before execution. Plan checkpoints catch a bad plan before any step lands. Named human decisions catch the consequences no system should own. You are not trying to make the agent never err. You are trying to make sure that when it errs, the error is cheap, visible, and recoverable.

How to build an AI agent: spec, tools, guardrails, evals

The build loop that holds has four stages in order: write the spec, give it tools, wrap it in guardrails, then evaluate before you widen its scope. Skipping any stage is how a demo dies in production.

Spec first. Write down the goal, the allowed actions, the stop condition, and the decisions the agent may never make alone. The spec is the program here, a point I argue in my AI-Native thesis that the machine does the job and the human evaluates. If you cannot specify the stop condition, you are not ready to build.

Tools, scoped tight. Give the agent exactly what the spec requires. Read replica, not primary. Draft endpoint, not send. Every tool you add expands the blast radius, so add deliberately.

Guardrails on every action. Validate tool inputs against schemas. Cap the loop length. Add a kill switch. Check for prompt injection on anything the agent reads from the outside, because untrusted content in an agent's context is an attack surface, not just data.

Evals before scope. Build the eval harness before you extend autonomy, not after the first incident. Then a run produces a trace you can grade.

# A graded agent run, scored on the trajectory, not just the answer

RUN ticket-triage agent=v3 model=mid-tier steps=6

step 1 tool=search_kb ok latency=420ms

step 2 tool=classify ok conf=0.91

step 3 tool=check_refund ok matched=true

step 4 action=flag_for_human ok # correct: never auto-decide refunds

step 5 tool=draft_reply ok latency=1180ms

step 6 action=stop ok

EVAL trajectory="pass" cost=$0.011 p95_latency=2.3s human_escalation=correct

Notice what the eval grades: the path, the cost, the latency, and whether the agent escalated the one decision it was told never to make alone. A final answer that looks right with a broken trajectory is a regression waiting to happen. The supporting walkthroughs on how to build AI agents and agentic coding go line by line; this is the shape of the loop.

Frameworks and tooling: a neutral survey

The framework you pick matters less than the discipline you bring. That said, the 2026 landscape has settled into a few credible choices, each with a different center of gravity.

LangGraph leans into graph-based control flow, persistence, and audit trails. It maps cleanly to production needs like rollback points, which is why it leads on enterprise adoption per the open-source framework surveys.
CrewAI centers multi-agent collaboration and has broad protocol support. Reasonable when several specialized agents genuinely need to coordinate, though remember the compounding-reliability tax on every added hop.
OpenAI Agents SDK optimizes for the fastest path to something running and works across many models. Good for a first prototype.

My honest take: most teams reach for a multi-agent framework when a single well-scoped agent, or even a plain workflow, would be more reliable and far cheaper. Gartner projects 40% of enterprise applications will feature task-specific agents by the end of 2026, up from under 5% in 2025. A lot of those will be frameworks solving problems the team did not have. Pick the simplest tool that meets the spec, and add complexity only when an eval shows you must. The supporting agentic AI frameworks comparison weighs each on what it costs you in practice, and the rundown of the best AI agents shipping today shows which choices actually hold.

Memory, retrieval, and context for agents

Memory is the part teams treat as an afterthought and then get burned by. An agent that runs across sessions needs an explicit memory architecture, or it forgets what it cannot fit in context and invents what it cannot recall.

Split memory into three kinds. Working memory is the live context window, handled by prompt design. Episodic memory is the structured log of past runs: what happened, when, with what result. Semantic memory is durable facts and preferences that should persist. Working memory is free. The other two are infrastructure decisions about where data lives, how it is indexed, and how stale entries get retired. I break down each layer in the supporting guide to memory systems for agents, and my book Memory Systems for Agents is the most rigorous treatment of this I have found.

Retrieval is where agentic systems meet RAG, and where they inherit RAG's failure modes. The demo retrieves perfectly. Then the corpus grows, the queries drift, and recall quietly collapses over the following months. Agentic RAG, where the agent decides what to fetch, can beat static retrieval on hard queries, but it adds latency and another failure surface. Use it when the query genuinely needs iterative search, not by default. If your knowledge layer is the bottleneck, that is exactly the work Devlyn does on RAG and knowledge integration.

Evaluating agents: a human reviews it is not a plan

You evaluate an agentic workflow by grading its trajectory against an eval harness, not by reading the final answer and nodding. Multi-step, tool-using systems fail in the middle, where a quick glance never looks. The harness is the only thing that scales with autonomy.

A real agent eval checks several layers. Did each tool call succeed with valid parameters? Did the agent escalate the decisions it was required to escalate? Did it stay inside cost and latency budgets? Did the trajectory match an acceptable path, even when the final output looked fine? This is the same discipline I argue for in building evals that predict production, applied to a moving target.

Vibes are not evals. If you cannot write a check for what the agent did, you are not evaluating it. You are hoping.

The reliability numbers above exist because most teams ship without this harness and discover the failure modes from customers. A multi-dimensional study of enterprise agentic systems measured a 37% gap between lab benchmark scores and real deployment performance, and a 50x cost spread across agents hitting similar accuracy. An eval harness is how you find out which side of that gap you are on before you scale. If you would rather have observability and evals built in from day one than bolted on after an incident, that is the work Devlyn does on AI observability and monitoring.

What agents cost: the latency budget and the revenue lens

Every agentic workflow trades latency and cost for autonomy. Anthropic says this plainly in its agents guide, and it is the trade most teams price wrong. A loop that calls a model six times costs roughly six times a single call, plus the latency stacks. Each step is a chance to be slow and a chance to be wrong.

From the revenue seat, this changes the calculus. An agent that resolves a ticket in 40 seconds and $0.04 may beat a human on cost. The same agent that takes four minutes and $0.40 because it looped on a hard case may lose money and a customer. The right question is never "can an agent do this?" It is "can an agent do this inside the latency and cost budget the business can afford?"

This is also why the biggest model is rarely the right one. Revenue rewards the model you can afford to run, ship, and explain, not the one that tops a benchmark. Route the easy steps to a cheap model and reserve the expensive one for the steps that need it. If you want a team that ships agentic workflows with this cost discipline built in, that is what Devlyn's engineers do, and it is the core argument of my book Agents That Actually Work.

Frequently asked questions

What is an AI agent?

An AI agent is software that pursues a goal by taking actions in a loop, using a language model to decide each next step from what it observes. It calls tools, reads results, adjusts, and repeats until a stop condition. The defining trait is that the model chooses its actions at runtime, rather than following a fixed path a developer wrote.

What is the difference between agentic AI and generative AI?

Generative AI produces an artifact you read and judge, like text or an image. Agentic AI takes the action itself, like sending the email or updating the record. The shift from producing output to executing action is what introduces risk: generation is reversible because nothing happens until a human acts, while agentic workflows remove that checkpoint.

Are AI agents reliable?

In a narrow band, yes. Broadly, not yet. Reliability compounds downward, so a small per-step error rate becomes a large end-to-end failure rate over many steps. Production data shows real-world success rates around 56% across diverse agents, well below demo performance. Agents are reliable on bounded, reversible, verifiable, tool-scoped tasks and unreliable outside that band.

When should you not use an AI agent?

Avoid an agent when the task is unbounded, the actions are irreversible, the output cannot be checked mechanically, or the tools are too broad to contain. Skip it too when a single model call or a fixed workflow would do the job more cheaply and reliably. Irreversible financial or legal actions should never run autonomously.

How do you evaluate an AI agent?

Grade the trajectory, not just the final answer, with an eval harness. Check whether each tool call succeeded with valid inputs, whether the agent escalated the decisions it was required to escalate, and whether it stayed inside cost and latency budgets. Build this harness before you widen the agent's autonomy, not after the first incident.

What is the best framework to build agents?

There is no single best framework; pick the simplest one that meets your spec. LangGraph suits complex, auditable control flow; CrewAI suits genuine multi-agent collaboration; the OpenAI Agents SDK suits the fastest prototype. Most teams overreach for multi-agent frameworks when one well-scoped agent, or a plain workflow, would be more reliable and cheaper.

Where this leaves you

Agentic workflows are real and valuable in the narrow band where tasks are bounded, reversible, verifiable, and tool-scoped. Outside it, autonomy compounds risk faster than it adds value. Build the spec, scope the tools, wrap the guardrails, and grade the trajectory before you widen the loop. That sequence is the difference between a system that holds and a demo that embarrasses you.

If you are deciding whether a process should be an agent at all, the deeper version of this argument lives in my book Agents That Actually Work, which goes through the narrow-band framework with production examples. And if you want a team that ships agentic workflows with evals and cost discipline built in from day one, hire AI engineers who have done it in production. The honest path is the faster one. Build for the narrow band, prove it with evals, and extend only when the numbers say you can.

Agentic Design Patterns That Actually Work

Alpesh Nakrani — Tue, 02 Jun 2026 18:30:00 GMT

The agentic design patterns that survive production are the bounded ones: tool-use with guardrails, plan-then-execute with checkpoints, reflection scored by evals, and human-in-the-loop only at named decisions. Everything else is a demo waiting to embarrass you. Between the impressive prototype and the quiet production failure sits a thin set of patterns that hold, and they all share one trait: they constrain the agent more than they free it.

I have spent the last two years putting agents into real operational workflows, where a bad output lands on a real customer, not a slide. This piece sits one level below the pillar field guide on AI agents and agentic workflows, and I wrote the broader case in an honest accounting of what agents can do today. What follows is narrower: the reusable patterns I trust, the failure mode each one fixes, and the cost you pay to run it.

A pattern earns its place by closing a specific failure mode, not by making the architecture diagram look sophisticated.

Key takeaways

If you read nothing else, take these five claims with you:

Tool-use with guardrails is the base pattern. Give the agent exactly the tools it needs, least privilege, and validate every call.
Plan-then-execute with checkpoints beats open-ended autonomy when the steps are knowable. You get predictability and a place to stop.
Reflection only pays when a machine scores it. Self-critique without an eval is the agent grading its own homework.
Human-in-the-loop belongs at named decisions, not "everywhere," or it collapses into rubber-stamping.
Multi-agent autonomy is the pattern to reach for last. It can stretch latency from under a second to tens of seconds and adds a failure surface most teams cannot debug.

Workflows first, agents only when you must

The most useful distinction in agentic design is not a pattern at all. It is the line between a workflow and an agent. A workflow orchestrates LLM calls and tools through code paths you wrote. An agent lets the model decide its own next action at runtime. Anthropic draws this line cleanly in Building Effective Agents, and their advice is blunt: most teams should use composable workflow patterns, and reach for full autonomy only when the task genuinely needs it.

This matters because autonomy is the expensive ingredient. The moment the model picks its own next step, you lose the predictability that makes traditional software debuggable. So the first agentic design pattern is a decision, not code: can I express this as a fixed sequence of steps? If yes, write the workflow. You will ship something that holds at 3am instead of something that demos well at 2pm. If you want a team that has drawn this line in production before, that is the kind of work the engineers you hire to build AI systems do every day.

When it works: any task where the steps are knowable in advance. Classification, extraction, routing, summarize-then-act. The failure mode: teams reach for an autonomous loop because it feels modern, then spend a quarter debugging behavior they could have hard-coded in an afternoon. Concrete example: a "research agent" that fetches three known sources, summarizes each, and merges them is a workflow. Calling it an agent does not make it one, and pretending otherwise just adds tokens.

Tool-use with guardrails: the base pattern

Tool-use is the foundation every other pattern builds on. The agent calls a function, reads the result, and decides what to do next. It appears in nearly every production system because it is how the model touches the world. The pattern that survives is tool-use with guardrails: least-privilege tool scopes, schema validation on every call, and a hard cap on iterations.

The guardrails are the whole point. An agent that summarizes documents should not hold a write key to the document store. An agent that drafts replies should not hold a send key. This is the principle of least privilege applied to a system that will, eventually, do something you did not predict. You are not preventing the surprise. You are shrinking its blast radius.

When it works: the task resolves in a single LLM call plus a few well-defined tools, with each tool's output checkable against a schema. The failure mode: the model hallucinates a tool result, or invents an argument the API never accepts, and proceeds as if the fiction were real. Concrete example: a fintech lead I worked with, Priya, ran a fraud-flagging agent that read a transaction, called a read-only risk API, and wrote a flag to a review queue, never touching the ledger. In its first month it mis-scored roughly 40 transactions out of 12,000, but because every flag landed in a human-reviewed queue and the agent held no write key, not one of those errors moved a cent. The blast radius was a reviewer's extra two minutes, not a refund.

# Guardrail, not vibes: validate before you trust the tool call

if tool_call.name not in ALLOWED_TOOLS:

raise ToolScopeError(tool_call.name)

result = run(tool_call)

if not schema.validate(result):

escalate(reason="tool output failed schema")

Plan-then-execute with checkpoints

When a task has multiple steps with dependencies, plan-then-execute beats open-ended reasoning. The agent produces an explicit plan first, then executes it step by step, pausing at checkpoints where the cost of a wrong step is high. The plan is inspectable before any action runs, which is the property that makes the pattern safe.

The classic open-ended version is ReAct, where the model interleaves reasoning and acting in a loop, introduced in the ReAct paper in 2022. ReAct is powerful and worth knowing. In production I bound it: an explicit iteration limit, a written plan the agent commits to, and checkpoints where execution stops for a check before continuing. Unbounded loops are how agents quietly burn tokens and drift off-goal.

When it works: multi-step tasks where the decomposition is knowable up front, like a data migration or a multi-stage report. The failure mode: without checkpoints, the agent commits step three before anyone notices step two was wrong, and now the error has propagated. Concrete example: a billing-reconciliation agent plans five steps, executes the first two against a staging copy, and stops at a checkpoint before any step that writes to the production ledger. The reversible steps run free. The irreversible one waits.

Reflection, but only with evals

Reflection is the pattern where the agent critiques its own output and revises it. Done right, it reduces hallucinations and improves reasoning depth. Done wrong, it is the agent grading its own homework and giving itself an A. The difference is whether a machine, not the model, scores the work.

My rule is simple. Reflection only counts when the critique is anchored to something external the agent cannot talk its way past: a test suite that runs, a schema that validates, a retrieval check against ground truth, an eval harness that returns a number. A code agent that writes a function, runs the tests, reads the failures, and fixes them is using reflection correctly, because the tests are the judge. A writing agent that "reviews its tone" and declares itself satisfied is using reflection as theater. This is the same discipline I argue for in the guide to evaluating AI agents: a human or a model saying "looks good" is not a measurement.

When it works: any output you can verify mechanically, where a failing check gives the model real signal to revise against. The failure mode: reflection without an external scorer inflates confidence while changing nothing, and a Reflexion loop that runs ten cycles can burn 50x the tokens of a single pass for no measurable gain. Concrete example: an extraction agent pulls fields from an invoice, validates them against a schema and a known total, and re-extracts only the fields that fail. The check decides when it is done, not the model's opinion of itself.

Reflection without an external scorer is the agent grading its own homework. Vibes are not evals, and self-critique is not verification.

Human-in-the-loop, only at named decisions

"Put a human in the loop" feels safe and scales terribly. The reviewer becomes a bottleneck, then a rubber stamp, then a liability. I take the full version of this argument further in my book Human in the Loop Is Not a Plan. The pattern that survives is narrow: a human is in the loop at named decisions, not everywhere.

A named decision is a specific, written checkpoint where the agent must stop and get a human verdict. Anything irreversible. Anything that touches money, a legal commitment, or a customer relationship. Everything else the agent handles, and routes only the named cases to a person with enough context to act fast. "A human reviews everything" produces a reviewer who reads the first line and approves on autopilot. "A human approves any refund over $500" produces oversight that actually happens.

When it works: high-stakes, low-frequency decisions where human judgment adds more than it costs. The failure mode: blanket review collapses into rubber-stamping under load, and the institutional weight of "human-approved" gets attached to outputs no human meaningfully read. Concrete example: a support team I advised, run by a lead named Daniel, let its agent draft and send routine replies on its own, but any message that mentioned a refund, a legal threat, or a churn risk stopped for a named human approval before it went out. The result was a clean split: the agent handled about 2,000 easy tickets a day, and Daniel's team reviewed only the 20 hard ones that actually needed judgment. The oversight that survived was the oversight that fit a human's day.

When NOT to use an agent

The most valuable pattern is knowing when to use no agent at all. If the task is fully specifiable, a workflow or plain code is cheaper, faster, and easier to debug. If the action is irreversible and high-stakes, do not hand it to an autonomous loop. If you cannot write a check for "good output," you cannot evaluate the agent, and an agent you cannot evaluate is a liability you have not measured yet.

Multi-agent systems deserve special skepticism. Orchestrating several specialized agents looks elegant on a whiteboard and behaves badly in production. The hidden economics of agents are unforgiving: a single LLM call that returns in about 800 milliseconds can balloon to 10 to 30 seconds once you wrap it in a multi-agent orchestrator with a reflection loop. The same analysis shows a ten-cycle reflexion loop can burn 50 times the tokens of one linear pass, and an unconstrained coding agent can run $5 to $8 per task.

That cost is a revenue decision, not a footnote. An agent that burns even 50 cents per task and runs a million times a month is a $500k annual line item before it earns a dollar. The narrow band where autonomy earns its keep is defined as much by unit economics as by capability. Reach for the simplest pattern that closes the failure mode, and reach for multi-agent autonomy last, if at all.

Composing the patterns

Real systems are not one pattern. They compose two or three, each fixing a different failure. A production agent I trust looks like this: a bounded tool-use core, wrapped in a plan-then-execute structure with checkpoints, with reflection scored by evals on the verifiable steps, and human-in-the-loop at the two or three decisions that are irreversible. Tool-use gives it reach. Planning gives it predictability. Reflection gives it self-correction with a real judge. The human handles the edges no system should own.

The order matters. Add patterns in response to failure modes you have observed, not requirements you imagine. Every pattern you add costs tokens, latency, and debugging surface. The discipline is to add only what closes a real gap, then stop. This is the same loop I describe in the AI-Native thesis: the machine does the work, and the human's job narrows to judgment, specifying what the agent may do and evaluating what it did. The patterns are how you make that judgment enforceable in code. The deeper version of this framework, with production examples and the failure modes behind each pattern, is in my book Agents That Actually Work.

Frequently asked questions

What are the main agentic design patterns?

The patterns that hold in production are tool-use with guardrails, plan-then-execute with checkpoints, reflection scored by evals, and human-in-the-loop at named decisions. Multi-agent orchestration is a fifth, used last because it multiplies cost and latency. Most reliable agents compose two or three of these, not all five.

How do I build a reliable AI agent?

Start with the simplest pattern that closes your failure mode, usually bounded tool-use. Give the agent least-privilege tools, validate every call against a schema, cap its iterations, and add an external eval before you trust its self-critique. Add planning and human checkpoints only where an observed failure justifies them. The step-by-step version lives in the guide to how to build AI agents.

When should I not use an agent?

When the task is fully specifiable, write a workflow or plain code instead, because it is cheaper, faster, and debuggable. Avoid autonomous agents for irreversible, high-stakes actions, and avoid any agent whose output you cannot check mechanically. If you cannot write an eval for it, you cannot trust it in production.

What do agentic workflows actually cost to run?

More than the demo suggests. A single call that returns in under a second can stretch to tens of seconds once it is wrapped in a multi-agent orchestrator with a reflection loop, and a long reflexion loop can burn many times the tokens of one pass. Treat cost and latency as design constraints from day one, not problems to fix after launch.

If you are building agents and want a team that ships them with guardrails and evals from day one, that is the work Devlyn does on AI observability and monitoring. The patterns above are the starting point. The discipline to apply only the ones a real failure mode demands is what separates an agent that holds from one that quietly breaks.

Agentic AI vs Generative AI: What's Actually Different

Alpesh Nakrani — Mon, 01 Jun 2026 18:30:00 GMT

Here is the difference, stated plainly. Generative AI produces content in response to a prompt: text, code, an image, a summary. Agentic AI plans and takes actions toward a goal across multiple steps, calling tools and deciding what to do next based on what it sees. The leap from generating to acting is where most of the new risk lives, because actions have consequences a paragraph of text never does.

I have spent two years putting both kinds of systems into production, where failures land on real people and real revenue. The agentic AI vs generative AI question gets answered badly almost everywhere, usually by a vendor who sells one of them. So let me answer it from the seat where engineering meets the P&L, and name the trade-offs nobody else will. This piece sits under my longer field guide to AI agents and agentic workflows, which is the place to go once you have settled the difference and want the build details.

Generation produces output you read. Agency produces actions the world reacts to. That single shift is the whole story.

Key takeaways

If you read nothing else, hold these five claims:

Output vs. action. Generative AI returns a result; agentic AI changes state in systems by calling tools across many steps.
Errors compound. A single generation is one shot to get right; an agent multiplies its error rate across every step, so reliability drops fast as tasks get longer.
New failure modes. Agents add irreversible actions, looping, hallucinated tool results, and prompt injection - risks a chatbot does not have.
Agents win in a narrow band. Bounded, reversible, verifiable, tool-scoped tasks. Outside it, plain generation plus a human is usually safer and cheaper.
Trust requires evals and guardrails. "A human reviews it" is not a plan. You need least-privilege tools, an eval harness, and human checkpoints at named decisions.

The precise difference: output versus action

Generative AI is a function. You give it a prompt, it returns content, and it stops. The model that drafts your email, writes a SQL query, or summarizes a document is generative. It is powerful and, on a single call, cheap to reason about. The worst it can do is hand you a wrong answer, which you then read and decide what to do with. The human is still the actor.

Agentic AI is a loop. You give it a goal and a set of tools, and it decides its own next action, observes the result, updates its plan, and acts again. It might query a database, send an email, write a file, or call an API, running for minutes or hours, until it judges the goal met. The model is no longer just producing text. It is taking actions in your systems, and some of those actions change the world.

That is the entire distinction. Generative AI answers what should I say? Agentic AI answers what should I do next, and then do it. The Neo4j and Descope teams frame the same line: generative systems refine how something is said; agentic systems decide what gets done and then do it. The capability is real. The risk surface is new.

One business consequence falls straight out of this. A wrong generation costs you a re-prompt. A wrong action can cost you a refund issued in error, a customer emailed the wrong terms, or a production record overwritten. Generation failures are private. Agentic failures are public, and they have a price.

The new failure modes agents introduce

Generative AI has one well-known failure mode: it hallucinates, and you catch it because you read the output. Agentic AI inherits that and adds four more that generation simply does not have. This is the part the explainers skip.

Errors compound across steps. A single generation gets one shot, but an agent chains many, and small errors cascade. The arithmetic is unforgiving: at 95% reliability per step, a 10-step task succeeds about 60% of the time, and a 20-step task barely 36%.

The benchmarks bear this out. On MLAgentBench, the best frontier model landed near a 37.5% average success rate across machine-learning experimentation tasks; on AutoPenBench, a fully autonomous penetration-testing agent cleared only about 21% of tasks. METR's time-horizon work puts a recent frontier model's "50% reliability horizon" at roughly 50 minutes of expert work, with that horizon doubling about every seven months. Long-horizon autonomy is still the frontier, not the floor.

At 95% reliability per step, a 20-step agent succeeds barely a third of the time. Compounding is the math that demos hide.

Irreversible actions. Generation is always reversible, because you just discard the draft. An action may not be: sending is not undoable, and a billing write is not undoable. My rule is simple. If an action cannot be reversed in under five minutes by a senior engineer with no external dependencies, it does not belong in the agent's autonomous tool scope.

Hallucinated tool results and loops. Agents confidently proceed on tool output that was never real, or get stuck repeating a step. A recent framework on failure modes in generative and agentic systems maps how vulnerabilities propagate across layers, from the model up through the agentic reasoning loop. The pattern I see in production is the same: agentic failures are rarely about bad prose. They come from the system losing the right context, constraints, and history as work unfolds across tools and time.

Prompt injection through tools. This is the security failure that scares me most. Once a model can execute tools, a malicious instruction hidden in a web page, an email, or an API response can hijack the goal. OWASP ranks prompt injection as LLM01, the top risk for LLM applications, two editions running. With a generator a poisoned input yields bad text; with an agent wired to your systems it yields bad actions, and the blast radius is the difference.

Where agents genuinely beat generation

Agents earn their keep in a narrow but real band: tasks that are bounded, reversible, verifiable, and tool-scoped. Inside that band, an agent does work a single generation cannot, because the work requires acting on what it finds, not just producing one answer. Outside it, generation plus a human is usually safer, faster, and cheaper.

Concretely, agents beat plain generation when the task needs real tool calls in sequence: triaging 200 support tickets against a known taxonomy and drafting a prioritized queue; extracting data from messy PDFs and validating each record against a schema; running a test suite, reading the failure, and fixing the code until it passes. The verification step is what makes these safe - the test either passes or it does not. Vibes are not evals.

Generation wins when the job is a single creative or analytical output a person will use directly: a first draft, an explanation, a summary, a code snippet you will review anyway. Reaching for an agent here adds latency, cost, and a control surface you do not need. I have watched teams wrap a five-step agent around a task one prompt solved. They paid more and shipped slower for the privilege of a worse failure mode.

A fintech team I advised wanted an agent to "handle dispute intake." The version that shipped did one bounded thing: read the dispute, classify it against their taxonomy, and draft a response for a human to send. It cleared roughly 200 intakes a day at about four cents each, and a person still pressed send on every reply. Their first design had let the agent issue provisional credits directly. In testing it credited a duplicate dispute twice off a misread field, and that single irreversible action is why the autonomous-credit tool never made it to production. The generative half was safe to ship; the agentic half had to be fenced.

The honest test is the constraint, not the hype. Match the tool to what would actually break. If the task is one output a human evaluates, generate. If it genuinely requires multi-step action with checkable results at each stop, an agent might earn its place. My honest accounting of what agents can do today walks the full band with production examples, and the book Agents That Actually Work goes deep on the bounded-reversible-verifiable-tool-scoped framework.

What you need before trusting an agent

Before you give an agent autonomy, you need three things generation never demanded: least-privilege tools, an eval harness for multi-step behavior, and human checkpoints at named decisions. "A human reviews it" is not a plan, because the reviewer becomes a bottleneck, then a rubber stamp, then a liability.

Least-privilege tools. Give the agent exactly the tools the task needs and nothing more. A summarizer gets no write key. An analytics agent reads from a replica, not the primary. This is the OWASP-recommended defense and the cheapest insurance you will buy: it shrinks the blast radius when, not if, the agent acts unexpectedly.

Evals that predict production. Test the trajectory, not just the final answer. A recent survey on evaluating LLM agents lays out why agent evaluation is its own discipline: the interactions are dynamic and long-horizon, so a single final-answer score hides where the run actually broke. You need traces of every step, success criteria you can check mechanically, and adversarial cases including injection attempts. I walk through building that harness in my guide to evaluating AI agents.

Human-in-the-loop at named decisions. Decide in advance which actions an agent may never take alone, such as anything irreversible, financial, or relationship-bearing, and route those to a person with enough context to act fast. Name the decisions, not a vague "review everything."

This is the AI-Native pattern, not the AI-assisted one. The machine does the work; the human's job contracts to judgment. That principle is the through-line of my AI-Native thesis, and it is exactly what separates an agent you can trust from a demo you cannot.

Agentic AI vs generative AI: comparison table

Dimension	Generative AI	Agentic AI
Core behavior	Produces content from a prompt	Plans and takes actions toward a goal
Execution	Single request, then stops	Multi-step loop: plan, act, observe, repeat
Tools	None required	Calls tools, APIs, databases, files
Effect on the world	Returns output you read	Changes state in real systems
Main failure mode	Hallucinated content	Compounding errors, irreversible or hijacked actions
Reliability profile	One shot to get right	Error rate multiplies across steps
What you need to trust it	Read and judge the output	Evals, guardrails, least-privilege tools, HITL
Best fit	One output a human will use directly	Bounded, reversible, verifiable, tool-scoped tasks

Frequently asked questions

What is the difference between agentic AI and generative AI?

Generative AI produces content from a prompt and then stops; agentic AI plans and takes actions toward a goal across multiple steps, calling tools and deciding its own next move. The practical difference is consequence: generation returns text you evaluate, while an agent changes state in real systems, so its mistakes carry costs generation does not.

What is agentic AI in simple terms?

Agentic AI is a system where a language model is given a goal and a set of tools, then loops, deciding what to do, doing it, checking the result, and adjusting until it judges the goal met. Unlike a chatbot, an AI agent can send an email, update a record, or run code on its own. That autonomy is the value and the risk.

Are AI agents reliable enough to trust with real work?

Only inside a narrow band, and only with guardrails. Agents are reliable on tasks that are bounded, reversible, verifiable, and tool-scoped, paired with evals and human checkpoints at named decisions. They are not yet reliable for long-horizon, ambiguous, or irreversible work, because errors compound across steps and current benchmarks still show high failure rates on long tasks.

When should I use generative AI instead of an agent?

Use generative AI when the job is a single output a person will use directly - a draft, a summary, an explanation, a code snippet you will review. Reach for an agent only when the task genuinely requires multi-step action with checkable results at each stop. Wrapping an agent around a one-prompt task adds cost, latency, and a worse failure mode for no benefit.

If you are deciding where agents fit in a real workflow, and where plain generation plus a human is the better bet, that is exactly the build problem my team takes on. You can hire AI engineers who ship agentic systems with evals and guardrails from day one, and wire in AI observability and monitoring so you see the trajectory, not just the final answer. The goal is never an agent for its own sake. It is the narrowest system that does the job and holds at 3am.

LLM Evaluation Metrics That Matter (and the Ones That Lie)

Alpesh Nakrani — Sun, 31 May 2026 18:30:00 GMT

The LLM evaluation metrics that matter measure what breaks in production: task accuracy on a frozen, production-sampled set; human-disagreement rate; faithfulness; latency at p95; and cost per resolved task. The ones that lie are aggregate accuracy on a set you keep editing and vanity benchmark scores like MMLU. The first group tells you whether to ship. The second group tells you whether you feel good. Those are different questions.

I have watched a team celebrate 94% accuracy on a Friday and roll back the model on Monday. Nothing about the model changed over the weekend. The metric was always lying; the weekend just gave production enough time to prove it. This piece is about how to tell the two kinds of LLM metrics apart before a customer does it for you. It is the metric chapter of my complete guide to LLM evaluation; for the harness these metrics plug into, see how to build an LLM evaluation framework.

A metric that cannot go down is not measuring your model. It is measuring your willingness to edit the test.

Key takeaways

If you read nothing else, these are the load-bearing claims:

Aggregate accuracy on an elastic eval set is the most common lie in AI. If the set grows whenever the number dips, the number is about the test, not the model.
Faithfulness and human-disagreement rate predict production failures that accuracy hides. Fluent and wrong still scores as a pass on a loose rubric.
Vanity benchmarks like MMLU are saturated. Frontier models cluster at 88-93%, so the number cannot rank them on your task.
Cost per resolved task is the one metric the business should see. It ties evaluation directly to the P&L.

Why most LLM metrics lie: the elastic ruler

The single biggest reason an LLM evaluation metric lies is that the set under it keeps changing. A developer adds easy cases when the score dips. A PM quietly drops the case that always fails. The number climbs, and everyone reads it as the model improving. It is not. The ruler got shorter.

I cover the mechanics of freezing and versioning a set in my essay on evals that predict production. The short version: sample your eval set from real production traffic, freeze it as a named artifact, and never let it grow organically. A frozen set can only score lower over time, which is exactly what makes it honest. You want a fixed ruler, not a rubber band.

This is also why benchmark scores belong in the "lies" column for production decisions. MMLU and HumanEval are saturated: GPT-5, Claude Opus, and Gemini all cluster in the high 80s and low 90s, so the score range has compressed until noise exceeds signal (benchmarkingagents.com). Worse, popular benchmark questions leak into training data, so a model can recall the answer instead of reasoning to it. A high MMLU score tells you the model has seen the test. It tells you nothing about your traffic.

Reference-based vs reference-free metrics: where each one lies

Most LLM evaluation metrics split into two families, and knowing which family you are holding tells you most of where it can lie. Reference-based metrics compare the output to a known correct answer. Reference-free metrics judge the output on its own, with no answer key. You need both, and you mislead yourself when you reach for the wrong one.

Reference-based metrics include exact match, BLEU, ROUGE, and BERTScore. BLEU and ROUGE measure surface word overlap; BERTScore measures embedding similarity. They are fast, cheap, and deterministic, which makes them tempting. They also miss a correct answer phrased differently, and they reward a wrong answer that happens to share words with the reference. Use them only where there is a tight expected output, like translation or extraction, and never as the gate for an open-ended chatbot. A high BLEU score on a support reply tells you the model copied the template, not that it solved the ticket.

Reference-free metrics include faithfulness, answer relevancy, and most LLM-as-a-judge scores. They assess the output directly against the question or the source context, so they handle open-ended tasks where no single answer exists. That flexibility is also their weakness: they inherit the blind spots of whatever judge model grades them. A reference-free score is only as honest as the model behind it, which is why the judge needs its own calibration before you trust it. The practical rule: reference-based for closed tasks with a real answer key, reference-free for the open-ended traffic that makes up most production systems.

The metrics that matter, one at a time

Each metric below earns its place because it catches a failure mode the others miss. For each, here is what it measures, when it lies, and how to read it.

Task accuracy on a frozen, production-sampled set. This measures whether the model gets your real cases right, scored against a reference, on a set that does not move. It lies the moment you let the set drift or score it with a loose rubric that accepts "close enough." Read it as a trend, not a snapshot: the same questions, the same rubric, this model versus the last one. A single accuracy number with no frozen denominator behind it is a vanity figure wearing a lab coat.

Human-disagreement rate. This is the fraction of cases where the model's output diverges from a calibrated human reference beyond a tolerance you set in advance. It is the metric I report to leadership, because it is anchored to human performance and to a fixed distribution. It lies if your raters are not blinded to the model version, because reviewers give quiet benefit of the doubt to a model they helped tune. Read it directionally: up means worse, down means better, and you can open the exact cases that moved.

Faithfulness (groundedness). Faithfulness measures whether every claim in the answer can be inferred from the provided context. RAGAS defines it concretely as the number of claims supported by the context divided by the total claims in the answer (Ragas docs). It catches the failure accuracy cannot see: a fluent, confident answer that contradicts the source. It lies when you score it with a cheap judge model that under-detects contradiction; in 2026, faithfulness scoring is only reliable with a strong reference model behind it. Read a low faithfulness score as a hallucination alarm, not a style note, and pair it with the discipline in my piece on measuring and reducing hallucination.

Latency at p95. Average latency is a comfort metric. The 95th percentile is the truth, because your slowest 5% of requests are where users rage-quit and where timeouts cascade. It lies only when you report the mean instead. Read p95 as a hard product constraint: a model that is 2% more accurate and 600ms slower at p95 may lose you more revenue than it earns.

Cost per resolved task. More on this below, because it is the metric that connects the whole harness to the business.

A comparison you can paste into a deck

Here is the same set of LLM evaluation metrics in one table: what each one measures, and whether it tells the truth about production.

Metric	What it measures	When it lies	Verdict
Task accuracy (frozen set)	Correct answers on real, version-locked cases	When the set drifts or the rubric goes loose	Matters
Human-disagreement rate	Model vs. calibrated human on a fixed set	When raters are not blinded to model version	Matters
Faithfulness	Claims supported by the provided context	When scored by a weak judge model	Matters
Latency p95	Worst-case response time users actually feel	When you report the mean instead	Matters
Cost per resolved task	Spend divided by tasks fully handled	When you count attempts, not resolutions	Matters
Aggregate accuracy (elastic set)	Average correctness on a set you keep editing	Whenever the set changes to chase the number	Lies
Benchmark score (MMLU, etc.)	Performance on a public, saturated test	Contamination and saturation; not your traffic	Lies

How to read AI evaluation metrics together, not alone

No single metric gates a deploy. A model can ace task accuracy and still fail faithfulness, which means it is confidently wrong on the cases where it diverges. A model can win on faithfulness and lose on p95, which means it is honest and too slow to use. The metrics are a panel, and the panel disagreeing is itself a signal.

Here is what that panel looks like coming out of a real eval runner, scored against a frozen set. The numbers are realistic, not from a specific live system.

# eval run against frozen set eval-set-2026-w24-v1.jsonl

python -m eval.runner \

--suite eval-set-2026-w24-v1.jsonl \

--model prod-candidate-2026-06-15

# metrics summary

task accuracy 0.883 # frozen set, up 0.006 vs prior

human disagree 6.8% # threshold 8.0%, PASS

faithfulness 0.91 # threshold 0.90, PASS

latency p95 1,920 ms # +180 ms vs baseline, FLAG

cost / resolved $0.041 # up from $0.034, FLAG

verdict GATE CLEAR # pending p95 + cost review

Notice that two metrics flag without blocking. The point of a flag is to make a trade-off visible instead of letting it ship silently. Accuracy went up, and so did latency and cost. Whether that trade is worth it is a business decision, and the harness puts the numbers in front of the people who should make it.

The one metric the business should see: cost per resolved task

Cost per resolved task is total inference spend divided by the number of tasks the system fully handled without a human finishing the job. It is the metric I put in front of revenue and finance, because it converts every engineering choice into money.

It lies in exactly one way, and the way is common: counting attempts instead of resolutions. A cheaper model that resolves 70% of tickets is not cheaper than a pricier model that resolves 90%, once you price in the human who cleans up the other 30%. The token cost per call dropped. The cost per resolved task went up. Teams optimize the first number and wonder why the support line got more expensive.

Token cost is what the model charges you. Cost per resolved task is what the model costs you. Only one of them is on the P&L.

This is also where evaluation stops being an engineering hobby and becomes a revenue lever. When you can say "this model resolves 8% more tasks at $0.007 less per resolution," you have turned an eval run into a margin argument. That is the sentence that gets an AI project funded, and the sentence most teams cannot say because they never measured resolution, only accuracy. Tracking cost per resolved task in production is squarely an AI observability and monitoring problem, not a one-off spreadsheet.

One honest trade-off: cost per resolved task is harder to compute than token cost. You need a reliable definition of "resolved," which often means instrumenting downstream outcomes and accepting some noise in attribution. It is worth the trouble. A metric that is approximately right about money beats one that is precisely right about tokens.

Where these metrics still fall short

Even the metrics that matter have a ceiling. A frozen set drifts from reality as the world changes, so it needs a scheduled refresh, versioned the way code releases are versioned. Faithfulness scoring inherits the blind spots of whatever judge model grades it. And cost per resolved task depends on a "resolved" definition that a product team has to own and defend.

None of this argues for a human reviewing every output instead. That does not scale, and I make the full case in why a human in the loop is not a plan. The answer is not more review. It is a metric panel that earns the right to gate a deploy, plus a human who designs and audits that panel. The machine does the work. The human evaluates the work, and the metrics are how the evaluation scales.

Frequently asked questions

What LLM evaluation metrics actually matter? Five: task accuracy on a frozen production-sampled set, human-disagreement rate, faithfulness, latency at p95, and cost per resolved task. They matter because each catches a production failure the others miss, and none of them improves just because you edited the test.

How do I measure LLM performance without lying to myself? Freeze your eval set, sample it from real traffic, blind your raters to the model version, and set pass thresholds before the run, not after. Read every metric as a trend on a fixed ruler, and never let the set grow to rescue a number.

Are benchmark scores like MMLU useful AI evaluation metrics? For ranking frontier models on your task, no. MMLU and HumanEval are saturated and contaminated, so scores cluster in the high 80s and reflect memorization more than reasoning. Use them for rough capability filtering, never as a deploy gate.

What is the difference between reference-based and reference-free metrics? Reference-based metrics (exact match, BLEU, ROUGE, BERTScore) compare the output to a known answer key, so they fit closed tasks like translation or extraction. Reference-free metrics (faithfulness, answer relevancy, LLM-as-a-judge) score the output directly with no answer key, so they fit the open-ended traffic most production systems handle. Use reference-based where a tight expected answer exists, reference-free everywhere else.

What is the single most important LLM metric for the business? Cost per resolved task. It divides total inference spend by tasks fully handled, so it captures quality and price in one number and puts evaluation directly on the P&L.

If you want the full harness these metrics plug into, including label-blinding protocols and how to handle disagreement, my book A Field Guide to Evals walks through it end to end. And if you would rather have a team instrument cost per resolved task and the rest of this panel in your stack from day one, that is exactly what Devlyn's AI observability and monitoring work is for. Measure what breaks. Ignore what flatters.

Evals that predict production, not vanity

Alpesh Nakrani — Sat, 30 May 2026 18:30:00 GMT

Every quarter I talk to engineering teams that are genuinely proud of their eval suite. Accuracy above ninety percent, a green CI badge, clean dashboards. Then they ship and something breaks in a way no metric predicted. The model hallucinates a product SKU. It trips over an edge-case utterance that a real customer submitted in week two of the pilot. The team goes back, adds a test, and calls it a lesson learned.

That is not an eval problem. That is a sampling problem dressed up as a measurement problem, and the distinction matters enormously when you are operating at the speed Devlyn operates, AI-Native from the ground up, where the engineering team does not have the luxury of a separate QA org to catch what the model misses. Our senior engineers own production readiness end-to-end. That means the eval harness has to do work that would otherwise fall to a QA team that does not exist.

This essay is about the harness I actually trust, not the one that looks good in a demo, but the one that has earned the right to gate a production deploy.

The sampling problem no one talks about

Most eval suites are built bottom-up: a developer writes cases while they are building the feature, a PM adds a few edge cases during review, and the set accumulates. The result is a distribution that reflects what the team imagined users would do, not what users actually do. Those two distributions can diverge badly.

The fix is mechanical but requires discipline: sample your eval set from real production traffic, freeze it, version it, and never let it grow organically again. At Devlyn we run a weekly job that samples a stratified slice of production requests, stratified by intent cluster, by confidence score, and deliberately over-weighted toward sessions where the model's confidence was low or where a human reviewer flagged a correction. That slice gets frozen as a named artifact: eval-set-2026-w23-v1.jsonl. It does not change. If we want to add new cases, we cut a new version.

Freezing the set sounds obvious until you realize what it implies: your eval score on an older frozen set can only go down. There is no sneaking in new easy cases to bring the number back up. That is the point. You want a fixed ruler, not an elastic one.

A moving eval set is not a ruler. It is a rubber band. The number it reports is a fact about the test, not the model.

The versioned artifact also gives you something you rarely get in AI projects: a straight historical comparison. You can ask whether the model you are about to deploy is better or worse than the one you deployed six months ago, on exactly the same questions, scored by exactly the same rubric. That is a sentence most teams cannot say with confidence.

Blind your labels before you score anything

When human raters score model outputs, they need to be blind to which model version produced each response. This sounds like an academic concern until you watch an engineer give a subtle benefit-of-the-doubt to outputs from a model they helped tune. It happens. It is not dishonesty; it is just human cognition.

Our rating pipeline strips the model version, shuffles the output order, and inserts a consistent proportion of gold-standard human responses into the batch without labeling them as such. Raters do not know whether they are scoring a model output or a reference human response. This matters for two reasons.

First, it catches rater drift. If your raters start scoring the planted human responses below the threshold that would pass a model, your rubric has broken down, either the raters have gotten sloppy or the rubric is no longer calibrated to what good actually looks like. That is a signal to stop and recalibrate before you score anything else.

Second, it gives you a concrete ceiling. Human-to-human agreement on the same task, scored by your rubric, is the ceiling your model will ever reach. If inter-rater agreement among humans is eighty-two percent on your hardest intent cluster, a model that hits eighty-five percent on that cluster is probably gaming the rubric, not genuinely exceeding human ability. Worth investigating rather than celebrating.

The book A Field Guide to Evals covers label-blinding protocols in detail, including how to handle cases where the model output and the human reference are both correct but stylistically different, which is where most rubrics quietly collapse.

Inter-rater disagreement is a rubric signal, not noise

When two experienced human raters disagree on a case, the instinct is to average their scores or escalate to a tiebreaker. Both responses treat the disagreement as an inconvenience to resolve. I treat it as data.

A cluster of disagreement on a particular intent type means one of three things: the rubric is ambiguous for that intent, the correct answer is genuinely context-dependent in ways the rubric does not capture, or the task is hard enough that reasonable people disagree. All three are useful to know before you deploy a model on that task. None of them should be smoothed over.

We track inter-rater agreement by intent cluster and over time. When agreement on a cluster drops below a threshold, we use seventy-five percent as our floor, we pause scoring on that cluster and run a rubric review. Sometimes this takes an afternoon. Once it took two weeks because the disagreement exposed a genuine product ambiguity about what the correct behavior should be in a specific scenario. Finding that ambiguity before the model did was unambiguously worth it.

The Eval-Driven Development framework treats inter-rater disagreement as a first-class artifact, something to log, trend, and review at the same cadence as accuracy metrics. That posture has influenced how we structure our rubric review cycles.

Over-sample the adversarial tail relentlessly

The hardest cases in production are not randomly distributed. They cluster. Users who are frustrated tend to phrase requests in unusual ways. Edge cases in your ontology attract certain user populations. Holiday traffic patterns expose latency cliffs that normal sampling never sees. A uniform random sample will under-represent every one of these.

Our eval set construction deliberately over-samples from four buckets: cases where the model's confidence score was in the bottom quartile; cases where a human reviewer submitted a correction; cases that are syntactically adversarial (unusual punctuation, code-switching between languages, truncated input); and cases that previously caused a production incident, even if we fixed the root cause. The last bucket is the one teams most often skip because the incident feels resolved. It is not resolved until a future model version passes those exact cases on a held-out eval.

Over-sampling the tail is a tradeoff: your aggregate accuracy metric will look worse than if you used a uniform sample. That is a feature, not a bug. A metric that reflects your hardest real-world traffic is more honest than one that reflects your average traffic. Ship the model that passes the hard set, not the model that has the prettiest aggregate number.

The metric worth reporting to the business is not aggregate accuracy. It is model-versus-trusted-human disagreement on a frozen production-sampled set, tracked over time.

The one metric worth reporting to the business

Leadership wants a number. That is legitimate. The question is which number tells the truth.

Aggregate accuracy on your eval set is affected by your sampling strategy, your rubric, your rater pool, and the model, four variables at once. When the number moves, you often cannot say which variable moved it. That makes it a poor basis for a go/no-go decision.

The metric I report instead: model-vs.-trusted-human disagreement on the frozen production-sampled set, tracked over time. Specifically: for each case in the frozen set, a panel of calibrated senior human raters produces a reference answer under the blinded protocol. The model's output is scored against that reference by a second panel of raters. The disagreement rate, the fraction of cases where the model and the human panel diverged beyond a tolerance threshold, is the number that goes in the weekly report.

This metric has properties that make it worth tracking. It is anchored to a fixed distribution (the frozen set), so changes in the number reflect changes in the model, not changes in the test. It is anchored to human performance, so it has a meaningful zero and a meaningful ceiling. And it is directional: if it goes up, something got worse; if it goes down, something got better, and you can investigate exactly which cases changed.

When a team asks me how they know whether their model is ready to ship, the answer is this metric, below a threshold that the product team and engineering agree on before the eval runs, not after. Setting the threshold after you see the result is not evaluation; it is rationalization.

Running the harness: what it actually looks like

Here is an abbreviated output from our eval runner against a frozen set. The numbers are realistic but not from a specific live system.

# eval run against frozen set eval-set-2026-w23-v1.jsonl

python -m devlyn_eval.runner \

--suite eval-set-2026-w23-v1.jsonl \

--model prod-candidate-2026-06-14 \

--rater-pool senior-3

# results summary

cases evaluated 847

recall@1 0.871 # up 0.009 vs prior candidate

recall@3 0.934

p95 latency 1,840 ms # +120 ms vs baseline, flag for review

human disagree 6.2% # threshold 8.0%, PASS

adversarial tail 14.1% # threshold 18.0%, PASS

inter-rater agree 81.3% # floor 75.0%, PASS

verdict GATE CLEAR # deploy gated on p95 review

A few things worth noting in that output. The p95 latency flag does not block the deploy gate, but it surfaces for a mandatory human review before the deploy proceeds. We have shipped models with higher latency when the product team accepted the tradeoff explicitly; we have also pulled deploys at this stage when the latency increase turned out to trace back to an infrastructure change that nobody had caught. The flag earns its place by making the tradeoff visible rather than implicit.

The adversarial tail number is always higher than the aggregate disagreement rate. That is expected, those cases are harder. The question is whether it is improving over model iterations, and by how much. A model that improves aggregate accuracy while holding adversarial tail disagreement flat has not actually gotten better where it matters.

Senior engineers own this, not a tooling team

The failure mode I see most often in mid-sized teams is treating the eval harness as infrastructure, something a platform team owns, something that runs in CI and produces a number that engineers passively consume. That posture produces eval suites that measure the wrong things with great precision.

At Devlyn, the engineers who own a model's behavior in production own the eval suite for that behavior. They write the rubric. They sit in on rater calibration sessions. They review the disagreement reports. They decide when a rubric needs revision and they do the revision. This is not optional work that happens when there is time. It is the work. Shipping a model without understanding the eval suite that gated it is the same as shipping code without understanding the tests.

That stance does not scale if your engineers are using eval infrastructure that requires a PhD to modify. It scales when the infrastructure is legible enough that a senior engineer can trace any metric back to the rubric choices and sampling decisions that produced it. Legibility is an engineering requirement, not a nice-to-have.

The broader argument, that Human in the Loop Is Not a Plan, applies here directly: you cannot outsource production judgment to a human review queue and call it a quality system. The model either meets the bar on your frozen, adversarially-sampled, human-calibrated eval set, or it does not ship. Full stop.

What this does not solve

No eval harness predicts every production failure. Distribution shift will always eventually break a frozen set, the world changes, user behavior changes, and a set sampled in one quarter may not represent the traffic you see two quarters later. The answer is a regular cadence of set refreshes, versioned the same way code releases are versioned, with a deliberate overlap period where you run both the old set and the new set and compare results. Pairing that cadence with live production observability is how you notice the drift before the next refresh, rather than after.

Latent failures, cases where the model produces a confident, plausible, incorrect answer, are also harder to catch. A strong recall metric will not surface them if your reference answers are wrong. This is why calibrated human raters matter more than automated scoring at the margin: a rater who knows the domain will catch the plausible-but-wrong case that an LLM-as-judge might wave through.

And evals do not tell you whether you are building the right thing. A model that perfectly answers the questions users are asking can still be failing at the product goal if those questions are the wrong questions. That is a product problem, not an eval problem, but it is worth naming so that a green eval result does not produce false confidence about product-market fit.

What the harness described here does provide: a reliable, operator-grade gate between a model candidate and a production deploy. It will not catch everything. It will catch most of the things that kill launches, and it will catch them before your users do. At Devlyn, that is the bar we hold and we do not lower it to ship faster. The cost of a bad launch in an AI-Native product is not one incident, it is the erosion of the trust that makes the entire product possible.

Build the suite that predicts production. Freeze it. Trust the number it gives you, not the one you wished it gave you.

The CRO's case for shipping smaller models

Alpesh Nakrani — Fri, 29 May 2026 18:30:00 GMT

Revenue rarely rewards the biggest model. It rewards the one you can afford to run, ship, and explain to a customer.

I have been in enough board rooms and enough inference billing conversations to tell you the uncomfortable truth about the AI hype cycle: the companies winning margin on AI-native products are not running frontier models on every request. They are running the smallest model that gets the job done acceptably, on a pipeline designed to escalate only when the small model cannot handle it. The frontier model gets the press release. The small model pays the rent.

This is not a contrarian take for its own sake. I have lived it at Devlyn, where we have built customer-facing AI into a retail experience that touches real people in stores, trying on eyewear, making a considered purchase. Every latency millisecond matters. Every token costs money. Every time a model says something wrong, a human employee has to fix it in front of a customer who just wanted help picking frames. The pressure to ship something correct, fast, and cheap is not theoretical. It is daily.

So when I see teams doing their AI strategy by ranking models on a benchmark leaderboard and picking the highest number, I know exactly where that road leads. It leads to a product that is technically impressive, operationally unsustainable, and commercially marginal. The leap from a research demo to a gross-margin-positive product almost always runs through a smaller model than you started with.

Key takeaway: The smallest model that clears the bar for your task usually wins on margin, latency, and control, not the highest score on a general benchmark.
Smallness is fit, not weakness. The right size covers the task distribution you actually ship against, fits your latency budget, runs in memory you control, and stays explainable.
Narrow the task before you grow the model. Decomposing a vague task into specific subtasks beats model-shopping and routinely cuts cost with no drop in user-facing quality.
Route small first, escalate rarely. A cascade runs every request through the small model and pays frontier prices only for the genuinely hard tail.
"Good enough" is an eval result, not a vibe. Defensible model selection rests on an eval suite that reports errors by failure mode and severity, plus the cost delta.

The frontier gets headlines; the small model gets the margin

Let me start with the arithmetic, because strategy without numbers is just opinion. As of mid-2025, a top-tier frontier model through a major inference API typically costs somewhere in the range of $10-$30 per million output tokens. A capable mid-size model, something in the 7B-20B parameter range, either hosted or self-deployed, costs an order of magnitude less. A fine-tuned, task-specific small model running on your own hardware or a dedicated endpoint can come in at 5-20x cheaper than that.

// Rough unit economics sketch, illustrative, not exact frontier_cost_per_call = $0.04 // ~2k tokens in+out at $15/M mid_model_cost_per_call = $0.004 // same volume, ~$2/M hosted small_model_cost_per_call= $0.0005 // fine-tuned, self-hosted endpoint // At 500k calls/month: frontier_monthly = $20,000 mid_model_monthly = $2,000 small_model_monthly = $250 // Gross margin impact at $0.10 ARPU per call: frontier_gm = ($50k - $20k) / $50k = 60% small_model_gm = ($50k - $250) / $50k = 99.5%

That arithmetic is not subtle. You can quibble with specific numbers, but the order-of-magnitude differences are real and durable. The gap between a frontier model and a well-deployed small model on a narrow task is not the gap between good and mediocre. It is frequently the gap between a company that can raise a Series B and a company that cannot explain its unit economics to an investor.

This is what I mean when I say the frontier gets headlines and the small model gets the margin. GPT-4-class performance on a general benchmark does not tell you anything useful about whether a model can classify your customer's lens prescription query correctly 97% of the time at $0.0003 per call. A well-trained specialist beats a brilliant generalist on a narrow task, every time, once you factor in the full operational picture. The short version is: generality is expensive, and most production workloads do not need generality.

Smallness is not about parameter count, it is about fit

I want to be precise about what "small model" means, because teams often conflate it with "cheap model" or "dumb model" and then use that conflation to justify staying on the frontier. Smallness is a fitness concept, not a quality ceiling.

A model is the right size when:

It covers the task distribution you actually ship against, not some imagined worst case that occurs 0.1% of the time. If your use case is extracting structured data from customer intake forms, a 7B fine-tuned model will outperform a 70B general model because the fine-tuned model has seen thousands of your specific form variations and learned the extraction schema cold. The general model is trying to solve a harder problem than the one you have.

It fits within your latency budget. Users in a retail environment will tolerate roughly 1.5 to 2 seconds for an AI response before it starts feeling broken. A frontier model, even well-hosted, frequently cannot hit that wall on complex prompts, especially with long context. A small model running on an endpoint with sub-200ms time-to-first-token gives you room to build a real UX. Latency is a product quality metric, not just an engineering metric, and it has direct revenue consequences in conversion-rate-sensitive environments.

It fits in memory you can control. Self-hosted small models, quantized to 4-bit or 8-bit, can run on commodity GPU hardware you own or rent dedicated. That means no rate limits from a shared API, no surprise pricing changes, no service terms that change overnight. For a company building a product on top of AI inference, control over the stack is a competitive moat. Dependency on a single frontier API is a vendor risk you are carrying at high cost.

It is explainable. This one gets underweighted constantly. When a customer in a Devlyn store gets a recommendation they do not trust, the employee needs to be able to explain why the system said what it said. "Our AI analyzed your face shape, lighting conditions, and stated preferences and ranked these frames" is an explanation. "A 200-billion-parameter model predicted this token sequence" is not. Smaller, task-specific models with tighter prompt scaffolding tend to produce outputs that are easier to trace back to inputs. That traceability matters to customers who have been burned before, and in 2025, more of them have been burned than have not.

The leap from a research demo to a gross-margin-positive product almost always runs through a smaller model than you started with.

Task-narrowing beats model-shopping

The most common mistake I see teams make is treating model selection as the primary lever for quality improvement. They are unhappy with an output, so they swap to a bigger model. The bigger model costs more and is sometimes slower, but the output is marginally better. They declare victory. Then the next edge case surfaces and they shop for a bigger model again. Within six months they are on the largest available model, costs have doubled, and the quality problems are still there, because the problems were never about model capacity. They were about task definition.

Task-narrowing is the discipline of making the problem smaller and more specific before you make the model bigger. It is harder than model-shopping. It requires you to actually understand your task distribution, what inputs you will receive, what outputs you need, what failure modes you cannot tolerate. It requires labeling data and building evals and being honest with yourself about where the system is failing and why. But it consistently produces better results at lower cost than the next size up on the leaderboard.

The mechanics of task-narrowing look like this: you take a vague task, "help customers find the right eyewear", and decompose it into specific subtasks: classify intent, extract stated preferences, map to product attributes, rank candidates, generate explanation. Each subtask has a narrower input distribution and a clearer success criterion than the original. Each subtask can be modeled separately. And once you have separated the tasks, you discover that most of them are solved competently by models that are not frontier-class.

I have seen teams cut inference costs by 70% through task decomposition alone, with no change in user-facing quality. The frontier model gets retained for one or two subtasks where it genuinely earns its cost. Everything else runs on smaller, faster, cheaper models. In Defense of Small Models walks through this decomposition methodology in detail, it is one of the frameworks I return to whenever a team tells me they "need GPT-4" for something.

Cascade and routing: small first, escalate only when needed

Task decomposition leads naturally to the question of routing. Once you accept that different subtasks warrant different models, you need an architecture for deciding which model handles which request. This is model routing, and it is one of the highest-leverage infrastructure decisions a team can make.

The simplest version of routing is a cascade: run every request through the small model first, check the output against a quality or confidence criterion, and escalate to a larger model only when the small model fails the check. Most requests never escalate. You pay frontier prices only for the tail of genuinely hard cases.

The criterion for escalation can be as simple as a confidence score from the small model, a length or complexity check on the input, or a lightweight classifier trained to predict when the small model will struggle. The more sophisticated version involves multiple tiers, a small local model, a mid-size hosted model, a frontier model, with escalation logic tuned to your task distribution. I have covered the design space for this in Model Routing; the key insight is that the routing logic itself is usually cheap and fast, and even crude routing rules capture most of the savings.

At Devlyn, we do not route on a single binary. We route on a combination of factors: the estimated complexity of the customer query, the confidence score from the initial classification pass, and the business context of the interaction (a customer who has been in the store for forty minutes and is close to a decision gets a different resource allocation than a first-touch browser). The result is that the overwhelming majority of interactions never touch our most expensive models. The customers who get the frontier model experience are the ones where it genuinely changes the outcome, and they are a small fraction of total volume.

This is what "outcomes over velocity" means in practice at Devlyn. We are not racing to deploy the biggest model first. We are building infrastructure that allocates the right resource to the right moment. That takes longer to design than plugging in an API key, but it produces a product that can scale without the economics getting worse as volume grows.

Proving "good enough" is an engineering discipline

The phrase "good enough" sounds like a concession. In an engineering context, it is a specification. You cannot decide whether a smaller model is good enough unless you have defined what "good" means and built the machinery to measure it.

Evals are the mechanism. A real eval suite for a production AI feature looks like: a held-out dataset of representative inputs, labeled with the outputs you want, with failure modes categorized by type and severity. You run every candidate model against the suite. You look at the distribution of errors, not just the headline accuracy number, because a model that fails 5% of the time on your most critical edge cases is categorically worse than a model that fails 8% of the time on low-stakes requests, even if the latter has a lower overall score.

This is not glamorous work. It requires domain expertise to label data correctly, it requires consensus on what failure modes matter most, and it requires the discipline to maintain and expand the suite as the task distribution shifts. Most teams skip it, then are surprised when the model they chose in a two-hour vibe-check session behaves badly in production.

The business reason to do this work is that it is the only defensible basis for a model selection decision. When your CEO asks why you are running a small model on the customer recommendation flow and not GPT-5, the answer cannot be "it seemed fine in testing." It has to be "we ran our eval suite on both models, here are the results by failure mode, here is the cost delta, and here is the confidence interval on quality." That is a conversation that moves at board level. Gut feeling does not.

Quantization deserves a section in that analysis. A model quantized from full 16-bit precision to 4-bit takes up roughly one-fourth the memory and runs meaningfully faster. On most tasks, the quality degradation is small enough to fall within the noise of your eval suite. But "most tasks" is not "all tasks", quantization tends to hurt most on tasks requiring careful numerical reasoning, long-context faithfulness, and fine-grained instruction following. Know your task before you quantize. Run the evals both ways. The savings are often real; the quality drop is sometimes real too. The only way to know is to measure.

Privacy, local inference, and the customer who was burned

There is a dimension of small models that does not show up in benchmark comparisons but shows up everywhere in enterprise sales: data sovereignty. When you run a small model on your own infrastructure, on-premises, in your VPC, on a device, customer data never leaves your perimeter. It does not transit a third-party API. It does not appear in training pipelines. It does not create contractual ambiguity about data residency.

In healthcare, in financial services, in retail environments with loyalty programs and rich customer profiles, this matters enormously. I have been in sales conversations where a procurement team killed a deal not because the product was wrong but because the inference architecture required sending customer data to a hosted API that the legal team would not approve. A local small model closed the deal.

There is also a quieter version of this issue: customers who have been burned by AI before are increasingly skeptical of what happens to their data. A customer who asks "does this thing remember what I told it last time I was in the store?" deserves an honest, clear answer. That answer is easier to give if the inference stack is under your control and not a black box sitting in someone else's cloud.

The companies winning margin on AI-native products are running the smallest model that gets the job done, on a pipeline designed to escalate only when the small model cannot handle it.

Local inference also enables offline-capable products, a retail kiosk that works when the store's WiFi is flaky, a field service tool that runs in a warehouse without reliable connectivity. These are not edge cases. They are real operational constraints in a large fraction of physical-world AI deployments. A small model you can ship to a device solves them. A frontier API does not.

The CRO's summary: build for the margin, not the demo

I want to close with the revenue frame, because that is the frame that ultimately determines whether any of this gets funded and sustained.

Every AI feature has a cost structure and a value structure. The cost structure includes inference costs, maintenance overhead, human-in-the-loop costs for errors, and latency penalties on conversion. The value structure includes incremental revenue per interaction, churn reduction, customer satisfaction lift, and employee productivity. Your job as an operator is to maximize the spread between those two curves, not to maximize the capability of the model you are using.

The frontier model maximizes the capability numerator but taxes the cost denominator heavily. In a narrow task, that tax rarely pays off. The value does not scale with model size once you are above the capability threshold for the task. Capability above threshold is waste. And waste compounds at scale in ways that can turn a good unit-economics story into a bad one as volume grows.

The smaller model, properly evaluated, properly fine-tuned, properly routed, hits the capability threshold at a fraction of the cost. The margin it generates is real and defensible. The latency it delivers is a product quality improvement. The explainability it allows is a sales and trust asset. And the control it provides over the inference stack is a competitive moat that gets harder to replicate as you accumulate operational experience running it.

None of this means you never use a frontier model. It means you treat the frontier model like the expensive specialist it is: you engage it for the cases where nothing else will do, and you build a system that contains those cases rather than defaulting to them. That system, the routing, the evals, the fine-tuning, the task decomposition, is the real AI product. The model is an input to it.

The companies that internalize this early will have gross margins in five years that companies still defaulting to frontier-everything will envy. The frontier will keep getting better. It will also keep getting expensive relative to what you actually need. The discipline to know the difference, to measure it, route around it, and build margin from it, is the CRO's case for smaller models. And it is the case I will keep making until the field catches up.

Frequently asked questions

Are smaller language models just worse than large ones?

No. Smallness is a question of fit, not a quality ceiling. On a narrow task with a well-understood input distribution, a fine-tuned small model frequently outperforms a much larger general model, because it has been trained on the exact problem you have rather than a harder, broader one. The capability you need is whatever clears the bar for the task; everything above that threshold is cost without value.

When should I still reach for a frontier model?

Treat the frontier model like the expensive specialist it is: engage it for the genuinely hard cases where nothing smaller will do, and build a system that contains those cases rather than defaulting to them. In a routed pipeline, those cases are typically a small fraction of total volume.

What is model routing, or a cascade?

A cascade runs every request through the small model first, checks the output against a confidence or quality criterion, and escalates to a larger model only when the small model fails the check. Most requests never escalate, so you pay frontier prices only for the tail of hard cases. The routing logic itself is usually cheap and fast, and even crude rules capture most of the savings.

If you are working through model selection, routing, and evals for an AI-native product and want help building the system rather than just picking a model, the Devlyn team works on exactly this.

LLM-as-a-Judge: When to Trust It

Alpesh Nakrani — Thu, 28 May 2026 18:30:00 GMT

LLM-as-a-judge is reliable for cheap, scaled, relative grading on a well-specified rubric. It is unreliable for absolute quality calls, novel cases, and anything where its own biases (position, verbosity, self-preference) contaminate the score. Use it to triage, not to gate. The judge sorts the pile; a human decides the cases that matter.

I reach for an LLM judge constantly, and I trust it about as far as I have measured it. That is the whole posture of this piece. An LLM judge is a fast, cheap instrument with a known error profile. Treat it like a smoke detector, not a fire marshal. It tells you where to look. It does not sign off on the building.

Use the judge to triage, not to gate. It sorts the pile. A human decides the cases that change a release, a contract, or a customer.

Key takeaways

LLM-as-a-judge is reliable for relative grading on narrow, checkable rubrics, and unreliable for absolute quality, novel cases, and anything its own biases touch.
Four named biases break it: position, verbosity, self-preference, and calibration drift. All four produce confident, well-formatted scores that hide on a dashboard.
Never judge with the same model family you are grading; self-preference adds a uniform tilt nothing else surfaces. For high-stakes launches, run a three-judge ensemble.
Validate against human labels with Cohen's kappa before you trust a verdict. On well-scoped tasks judges reach κ of 0.75 to 0.83; below ~0.6 the rubric is broken.
The pattern that works in production is hybrid: the judge triages every output, humans gate the high-stakes tail, and corrections feed the rubric back.

How an LLM judge actually works

LLM-as-a-judge means using one language model to grade the output of another against a rubric you write. You hand the judge a prompt, the candidate response, and a scoring instruction. It returns a verdict: a score, a pass/fail, or a preference between two answers. That is the entire mechanism. The power is that it scales human-style judgment to thousands of cases per minute at a fraction of a human reviewer's cost.

There are two common protocols, and the choice matters more than people expect. Pointwise scoring asks the judge to rate one response on an absolute scale. Pairwise comparison asks it to pick the better of two. Pairwise tends to track human preference more faithfully because a relative call is easier than calibrating to an abstract scale. But pairwise has its own hazard: a 2025 study on feedback protocols found pairwise preferences flip in roughly 35 percent of cases versus 9 percent for absolute scores, and pairwise judges are more easily fooled by distractor features a generator learns to exploit. Neither protocol is free. You pick the failure mode you can tolerate.

Here is the shape of a judge prompt I would actually ship. Note that it is narrow on purpose.

# Judge prompt: factual grounding, pairwise, rubric-bound

"You compare two answers to a support question against the SOURCE doc."

"Score ONLY factual grounding. Ignore length, tone, and style."

"A claim is grounded if SOURCE states it. Unstated claim = ungrounded."

"Output JSON: {winner: A|B|tie, ungrounded_claims: [...], reason: str}"

# Then: randomize A/B order per call. Strip model names.

The instruction does one thing. It scores grounding, not vibes. It names what to ignore. It forces a structured output you can parse and audit. And the harness randomizes order on every call, because the judge has a thumb on the scale you have not seen yet.

Notice what the prompt does not do. It does not ask "is this a good answer." Good is a word that means everything and measures nothing. The moment your rubric contains a judgment a smart human would hesitate on, the judge stops being an instrument and starts being an oracle, and oracles are exactly what you cannot audit. Every line in a judge prompt should map to something you could check by hand on a sample. If you cannot check it by hand, the judge cannot either; it will just hide the guess inside confident JSON.

Where an LLM judge is trustworthy

An LLM judge earns its keep in four conditions, and they share a theme: the rubric is mechanical and the answer is checkable.

Relative, not absolute. "Is A more grounded than B?" beats "Rate this 1 to 10." Comparison anchors the judge; absolute scores drift across runs.
Narrow, checkable rubrics. Format compliance, schema validity, "does the answer cite the retrieved passage," refusal detection. These have near-binary ground truth.
High volume, low stakes per call. Regression-testing 5,000 responses after a prompt change. No single verdict gates a deploy; the aggregate trend does.
Triage and routing. Flag the bottom decile for human review. Even a noisy judge that surfaces 80 percent of the bad cases saves the reviewer most of the pile.

That last use is where the economics land. A human reviewing every output is a cost that grows linearly with traffic and never stops. A judge that filters the queue down to the cases worth a human's time turns a linear cost into a fixed one. This is the same argument I make about why a human reviews it is not a plan: review has to scale with autonomy, and an unaided human does not.

The inverse of this list is just as useful. Do not trust a judge to certify medical, legal, or financial correctness, to rank creative quality, to make the final call on a safety refusal, or to score anything where the right answer is genuinely contested. Those are absolute-quality calls on high-stakes, often novel cases. They are exactly the conditions where the biases below do the most damage. In those domains a judge can still pre-sort the queue. It just cannot be the last signature.

The biases that break the judge

LLM judges fail in named, measurable ways. In 2025 and 2026, researchers documented these well enough that you have no excuse for being surprised. Reporting on bias benchmarks found frontier models exceeding 50 percent error rates on challenging bias tests. Here are the four that have cost me time.

Position bias. The judge favors the answer in slot A (or slot B) regardless of quality. A study across 15 judges and ~150,000 instances found this is systematic, not random, and varies by judge and task. Mitigation: randomize order, and run both orders, then count a win only if it survives the swap.
Verbosity bias. Longer answers score higher even when no more correct. The judge mistakes elaboration for quality. Mitigation: tell it to ignore length, and spot-check whether your scores correlate with token count.
Self-preference bias. A judge rates outputs from its own model family higher. The self-preference bias work traces this to perplexity: the judge prefers text that is familiar to it. Applied 2026 reporting puts the tilt at a uniform 10 to 25 percent, and nothing else you do will surface it. The cardinal rule: never use the same family as judge and candidate. For high-stakes launches, run a three-judge ensemble across families and aggregate by majority vote.
Calibration drift and novelty. On out-of-distribution cases the judge invents a standard. It has no anchor for an answer it has never seen scored, so it guesses with confidence.

The dangerous property is that all four produce confident, well-formatted verdicts. The judge does not flag its own uncertainty. A wrong score and a right score look identical on the dashboard. That is why an unvalidated judge is worse than no judge: it launders noise into a number people trust.

A wrong verdict and a right verdict look identical on the dashboard. An unvalidated judge launders noise into a number people trust.

How to validate the judge against human labels

You do not get to trust the judge until you have measured it against humans on your task. A judge is one instrument inside a larger harness, and it earns trust the same way every other eval does: against ground truth. If you are building that harness from scratch, start with my complete guide to LLM evaluation and the broader question of which metrics actually matter and which ones lie. The validation protocol below is not complicated, and skipping it is the single most common mistake I see.

Build a calibration set of a few hundred cases sampled from real traffic. Have trusted humans label them with the same rubric the judge uses. Then run the judge on the same set and compute agreement. Not raw accuracy, but Cohen's kappa, which corrects for the agreement you would get by chance. Raw "85 percent agreement" can be near-random if one label dominates. Kappa tells you the truth.

What counts as good? Recent work gives useful anchors. The Judge's Verdict benchmark measures judge capability through human-agreement kappa. In practice, applied 2025 and 2026 studies report substantial agreement in the 0.75 to 0.83 kappa range on well-scoped tasks: smart-home agent grading at κ = 0.83, patch evaluation at κ = 0.75. Those are tasks with tight rubrics and checkable answers. The rough 2026 consensus is that κ above 0.6 is acceptable for production and κ above 0.8 is strong. Below ~0.6, the judge is too noisy to gate anything; use it only to triage. Treat these as illustrative targets, not promises: your number depends on your rubric and your task.

# Validate before you trust. Re-run on every judge-prompt change.

judge_labels = run_judge(calibration_set)

human_labels = load_human_labels(calibration_set)

kappa = cohen_kappa(judge_labels, human_labels)

# kappa >= 0.75 -> gate candidate. 0.6--0.75 -> triage only.

# kappa < 0.6 -> rubric is broken. Fix the rubric, not the model.

Two disciplines make this hold. First, re-validate whenever you change the judge prompt, the judge model, or the task. A rubric edit can quietly tank your kappa. Second, when the judge and humans disagree, read the cases. Disagreement is a rubric signal, the same way I treat inter-rater disagreement in a production eval harness: it usually means the rubric is ambiguous, not that the humans are wrong.

The hybrid pattern: judge triages, humans gate the tail

The pattern I trust in production is not "judge or human." It is the judge handling volume and humans owning the cases that matter. Concretely:

The judge scores everything. Cheap, fast, on every output. It produces a score and a confidence proxy (margin between A and B, or self-reported certainty).
Clear passes ship. High-confidence, high-score, on-distribution cases clear automatically. This is most of your traffic.
Humans gate the tail. The bottom decile, the low-confidence cases, anything novel or high-stakes, and a random audit slice route to a human. The human's verdict is authoritative.
The tail feeds the rubric. Human corrections on the routed cases become new calibration labels. The judge gets re-validated against them. The loop tightens.

This is where engineering meets the P&L. The judge converts review cost from linear-in-traffic to roughly fixed, and it does so without pretending the model is trustworthy on its own. You ship faster because most of the queue clears automatically, and you sleep at night because the expensive mistakes still hit a human before they hit a customer. Evaluation is the scarce skill in the judgment economy, and the judge is leverage on it, not a replacement for it. The full protocol, from sampling and blinding to kappa thresholds and the routing rules, is what I lay out in A Field Guide to Evals.

The honest trade-off: the hybrid only works if the routing is right. Set the threshold too loose and bad outputs ship under a green score. Set it too tight and every case routes to a human, which is the bottleneck you built the judge to avoid. The routing threshold is itself a thing you have to tune and monitor, and it drifts as traffic changes. There is no set-and-forget version of this.

Frequently asked questions

Is LLM-as-a-judge reliable?

It is reliable for relative grading on narrow, checkable rubrics, and unreliable for absolute quality calls or novel cases. Validate it against human labels using Cohen's kappa before you trust any verdict. On well-scoped tasks, judges reach κ in the 0.75 to 0.83 range; below ~0.6 they are too noisy to gate anything.

Can an LLM evaluate another LLM fairly?

Only with guardrails. Using an LLM to evaluate an LLM introduces self-preference bias: a judge rates its own model family higher because that text is familiar to it. Judge with a different model family than the one you are grading, randomize answer order to kill position bias, and tell the judge to ignore length.

What biases affect LLM grading?

Four named ones: position bias (favoring an answer by its slot), verbosity bias (rewarding length over correctness), self-preference bias (preferring its own family's text), and calibration drift on out-of-distribution cases. All four produce confident, well-formatted scores, so they hide on a dashboard unless you measure for them.

Should I use pairwise or pointwise scoring?

Pairwise comparison tracks human preference more faithfully and avoids scale drift, but flips more often (~35 percent of cases) and is more exploitable by distractor features. Pointwise absolute scoring is more robust to manipulation but drifts across runs. Pick the failure mode you can tolerate for your task, and validate either way.

If you are wiring an LLM judge into a real pipeline, the part that pays off is the instrumentation around it: the kappa checks, the routing thresholds, the audit slice, the drift alarms. That is the AI observability and monitoring work a Devlyn pod builds in from day one, so the judge stays honest as your traffic moves. Start by measuring your judge against humans. The number you get back will tell you whether you are triaging or gating.

RAG Evaluation: Measuring Retrieval Before It Collapses

Alpesh Nakrani — Wed, 27 May 2026 18:30:00 GMT

RAG evaluation works when you measure retrieval and generation separately. Recall@k and context precision on a frozen golden set catch the silent recall decay that end-to-end answer-quality metrics hide. A single "is the answer good?" score blends two systems into one number, so when retrieval rots you see a small wobble in answer quality and miss the cause entirely. Build the golden set before you ship. After the corpus drifts, you no longer have a clean reference to measure against.

I have watched this exact failure at more than one company. The demo retrieves perfectly, answer quality scores 0.9, everyone ships. Three months later the support tickets climb and nobody can say why, because the only metric on the dashboard moved two points. This piece is the retrieval discipline I trust, and where it bridges the broader evals that predict production practice.

A single answer-quality score is two systems wearing one number. When retrieval rots, that number barely flinches.

Key takeaways

If you read nothing else, these are the load-bearing claims:

Evaluate retrieval and generation as two separate systems. Recall@k and context precision measure the retriever; faithfulness and answer relevance measure the generator.
Recall@k is the early-warning metric. It decays first and silently, weeks before answer quality visibly drops.
Build a frozen golden set before launch. Sample real queries, label the relevant chunks, version it, and never edit it to chase a number.
Faithfulness catches the confident wrong answer. It is the fraction of answer claims supported by the retrieved context.
Recall decay is a revenue leak. Every query that should resolve and does not becomes a ticket, a refund, or a churned account.

Why end-to-end RAG evaluation hides the failure

Most teams evaluate a RAG pipeline the way they evaluate a chatbot: ask a question, grade the final answer. That is end-to-end evaluation, and it is necessary but not sufficient. A RAG system has two stages that fail for different reasons. The retriever fetches chunks. The generator writes an answer from those chunks. Grade only the output and you cannot tell which stage broke.

The reason this matters is that the two stages decay on different clocks. Generation quality is mostly stable, because the model does not change unless you change it. Retrieval quality erodes constantly, because the corpus grows, the query distribution drifts, and the embedding space gets crowded. I covered the full arc of this in why most RAG pipelines fail in month three. The short version: recall is the metric that rots, and end-to-end scores are too far downstream to see it early.

Here is the mechanism. When recall drops from 0.9 to 0.75, a good generator papers over the gap. It still writes a fluent answer from the weaker context, and for a while that answer is right often enough that your answer-quality metric reads 0.88 instead of 0.90. Two points. Inside a noisy metric, two points is invisible. Meanwhile a quarter of your queries are now answered from incomplete context, and the worst of them are confidently wrong.

The retrieval metrics that matter: recall@k and context precision

Retrieval evaluation needs two metrics working together. Recall@k asks: of the chunks that should have been retrieved for this query, how many showed up in the top k? Context precision asks: of the chunks that were retrieved, how many are actually relevant? Recall is about misses. Precision is about noise. You need both, because a retriever can be tuned to win one at the cost of the other.

Recall@k is the one I watch first. Recall is what collapses silently, and it collapses before anything downstream reacts. If recall@5 was 0.91 at launch and reads 0.72 this week, retrieval is failing and the only question left is how many users have already felt it. Context precision matters because a retriever that floods the context window with marginally relevant chunks degrades generation and inflates cost, even when recall looks fine.

These map to the standard open-source definitions. RAGAS scores context precision and context recall as core retrieval metrics, and is explicit that retrieval and generation should be measured on separate axes (Ragas metrics docs). The point is not the tool. The point is the separation. Whatever you use, keep the retriever score and the generator score in different columns.

RAG evaluation metrics at a glance

Here is the panel I keep, split by the axis it measures. Retrieval metrics tell you whether the right context arrived. Generation metrics tell you what the model did with it. Read them in that order, because a generation score is only meaningful once you know retrieval was sound.

Metric	Axis	What it catches	What it hides
recall@k	Retrieval	Missed chunks the answer needed	Nothing about answer wording; this is the early-warning metric
context precision	Retrieval	Noise and irrelevant chunks crowding the window	Misses, if recall is not watched alongside it
MRR	Retrieval	The right chunk ranked too low to use	Whether other relevant chunks were found at all
faithfulness / groundedness	Generation	Claims unsupported by the retrieved context	A grounded answer built on the wrong context
answer relevance	Generation	Answers that drift off the question	Whether the answer is actually correct

The "what it hides" column is the one most write-ups skip. Every metric here can read green while another reads red, which is the whole reason you score them separately and gate on the weakest, not the average.

Build the golden set before you ship

A golden set is a frozen, labeled collection of queries paired with the chunks that should be retrieved to answer them. It is the reference that makes recall@k computable. Without it you have no ground truth, and "did we retrieve the right thing?" becomes an opinion instead of a number.

Build it before launch, for one blunt reason. After the corpus drifts you cannot reconstruct what "correct retrieval" looked like at launch. The labels you make in month three are contaminated by the system's current behavior, so you end up grading the retriever against itself. A golden set built on day one is a fixed ruler. Here is the build, four steps:

Sample real queries. Pull 150 to 300 from production logs or beta traffic, weighted toward the queries that matter to revenue, not the easy ones.
Label the relevant chunks. For each query, a human marks which corpus chunks genuinely answer it. This is the expensive step and the one you cannot skip.
Freeze and version it. Save it as a named artifact, golden-set-2026-w24-v1.jsonl, and treat edits as a new version, never an in-place fix.
Schedule a refresh. The set drifts from reality over time, so re-sample on a cadence and version each refresh like a code release.

The full discipline for sampling and label-blinding lives in my book A Field Guide to Evals, and the RAG-specific version of it in RAG That Survives Contact. The honest trade-off: labeling a golden set costs real human hours, and people resist spending them before a launch. They spend far more hours later, in support and incident review, when retrieval fails and there is no reference to debug against.

A recall@k eval log you can actually read

Here is what a retrieval evaluation run looks like against a frozen golden set. The retriever score and generation score sit in separate blocks, on purpose. The numbers are realistic, not from a specific live system.

# retrieval eval against frozen golden set

python -m rageval.retrieval \

--golden golden-set-2026-w24-v1.jsonl \

--index prod-corpus-2026-06-15 \

--k 5

# retrieval metrics (the early-warning panel)

recall@5 0.74 # launch 0.91, threshold 0.85, FAIL

recall@10 0.86 # launch 0.96, drifting

context precision 0.81 # threshold 0.80, PASS

mrr 0.68 # rank of first relevant chunk slipping

# generation metrics (scored separately)

faithfulness 0.88 # threshold 0.90, FLAG

answer relevance 0.89 # looks fine in isolation, MISLEADING

verdict BLOCK # recall@5 below gate; retrieval, not the model

Read that log the way the gate reads it. Answer relevance is 0.89, which on an end-to-end dashboard would clear the bar and ship. But recall@5 has fallen to 0.74 against a launch baseline of 0.91. The retriever is missing a quarter of the chunks it should find, and the generator is covering for it well enough that the output still scores fine. The separated panel catches what the blended score would have waved through.

Recall@5 fell from 0.91 to 0.74. Answer relevance stayed at 0.89. Only one of those two numbers is telling you the truth.

Faithfulness and groundedness: grading the generator

Once retrieval is measured, grade generation on its own axis. Faithfulness, sometimes called groundedness, measures whether every claim in the answer can be inferred from the retrieved context. RAGAS defines it as the number of answer claims supported by the context divided by the total claims in the answer (Ragas faithfulness docs). A faithfulness of 0.88 means roughly one claim in eight is unsupported by what was retrieved. That is your hallucination rate, expressed as a number you can gate on.

Faithfulness and recall interact in a way worth naming. When recall drops, faithfulness often holds steady on its own terms, because the generator is being faithful to weak context. The answer is grounded in what it retrieved; it just retrieved the wrong things. This is exactly why you cannot collapse the two. A faithful answer built on a recall failure is a confident, well-sourced, wrong answer. Those are the ones that cost you a customer.

Pair faithfulness with answer relevance, which checks whether the answer addresses the question rather than drifting. Neither is reliable when scored by a weak judge model; faithfulness scoring in 2026 needs a strong reference model behind it to detect contradiction. I work through judge reliability in detail in when to trust LLM-as-a-judge, and the broader metric panel in the LLM evaluation metrics that matter.

Recall decay is a revenue leak

Here is the part most retrieval evaluation write-ups skip. Recall is not an engineering vanity number. Every query that should resolve and does not is a business event. In a support deployment it becomes a ticket a human now handles, which is cost. In self-serve it becomes a user who did not find the answer and churned, which is lost revenue. In sales enablement it becomes a rep quoting something the system never surfaced, which is risk.

So the metric to put in front of the business is not recall@5 by itself. It is the unresolved-query rate that recall decay drives, priced. When recall@5 falls from 0.91 to 0.74, model that against the fraction of queries now answered from incomplete context, then against what each unanswered query costs you downstream. That sentence, "retrieval decay is costing us X per month in support load," is what gets a retrieval fix prioritized. The recall number alone gets a shrug.

This is also where evaluation stops being a one-off script and becomes production instrumentation. Recall decays continuously, so you measure it continuously, the same way you measure latency or error rate. The loop that keeps the gate honest is simple: every low-scoring production query gets fed back into the golden set, so the set you measure against grows toward the queries that actually break. Standing up that kind of monitored, gated retrieval pipeline is squarely what Devlyn's RAG and knowledge integration builds, with the golden set and the eval gate wired in from day one rather than bolted on after the first incident.

Where RAG evaluation still falls short

Even a clean retrieval-plus-generation panel has a ceiling, and I would rather name it than oversell the method. A golden set drifts from reality as the corpus and the queries change, so a set you never refresh slowly stops measuring your live system. Labeling relevance is partly subjective, especially for queries with several defensible answers, so context precision carries some irreducible noise. And faithfulness scoring inherits every blind spot of the judge model grading it.

None of this argues for the fallback everyone reaches for, which is having a human read every RAG answer before it ships. That does not scale, and I make the full case in why a human in the loop is not a plan. The answer is a separated metric panel that earns the right to gate a deploy, plus a human who designs the golden set and audits the gate. The machine does the retrieval and the generation. The human evaluates both, separately, and the metrics are how that judgment scales past the demo.

Frequently asked questions

What is RAG evaluation? RAG evaluation is the practice of measuring a retrieval-augmented generation system on two separate axes: retrieval quality (recall@k, context precision) and generation quality (faithfulness, answer relevance). Scoring them separately is the whole point, because the retriever and the generator fail for different reasons and on different timelines.

What metrics should I use to evaluate a RAG pipeline? For retrieval, use recall@k and context precision, plus MRR if rank order matters. For generation, use faithfulness (groundedness) and answer relevance. Watch recall@k first, because it decays earliest and most silently, weeks before answer quality visibly drops.

How do I build a golden set for RAG? Sample 150 to 300 real queries, have a human label which corpus chunks should answer each one, freeze it as a versioned artifact, and refresh it on a schedule. Build it before you ship, because after the corpus drifts you can no longer reconstruct what correct retrieval looked like at launch.

Why does recall matter more than answer quality early on? Because a good generator masks weak retrieval. When recall falls, the model still writes a fluent answer from incomplete context, so answer-quality scores barely move while a growing share of answers are confidently wrong. Recall@k is the early-warning metric; end-to-end scores lag it by weeks.

If you want the full harness this plugs into, including label-blinding and gate design, my book A Field Guide to Evals walks through it end to end, and RAG That Survives Contact covers the retrieval-specific version. If you would rather have a team stand up a gated RAG pipeline with the golden set and eval harness built in from day one, that is exactly what Devlyn's RAG and knowledge integration is for. Measure retrieval separately. Catch the decay before a customer does.

When doing is cheap, deciding is everything

Alpesh Nakrani — Tue, 26 May 2026 18:30:00 GMT

If generation costs approach zero, value migrates to whoever can tell good output from bad. What that does to a company.

There is a standard argument that runs through most AI coverage right now: automation drives down the cost of production, so the economy shifts toward creativity, toward things machines cannot replicate, toward the ineffable human. It is a reassuring story. I do not think it is the right one. Or rather, it is the right destination but wrong on what the scarce input actually is.

The scarce input is not creativity. It is judgment. The ability to tell good output from bad output, fast, in cases you have never seen before, at a scale no human reviewer workflow was designed for. Call it the judgment economy: the regime where generation is abundant and the binding constraint is whoever can evaluate that generation correctly.

I have spent the last several years sitting at the intersection of revenue, technology, and operations, first as CTO, then COO, now CRO at Devlyn, a company that operates hundreds of optical retail locations. We generate, review, and deploy a very large number of decisions every day: pricing decisions, inventory decisions, scheduling decisions, customer interaction decisions. As generation costs fell, first with software tooling, now with language models, I watched where the pressure landed. It landed on judgment, every time. Not on the people who produced the work. On the people who could tell, quickly and reliably, whether the work was right.

This essay is about what that shift actually means for how a company prices its output, organizes its talent, builds moats, and thinks about margin. It is not abstract. Every mechanism I describe I have seen play out in an operating company or can model from first principles of cost economics.

The arithmetic of zero-marginal-cost production

Start with a simple model. Call the cost of producing one unit of work C. For a long time, C was dominated by labor: a writer writes a paragraph, an analyst builds a model, an engineer implements a feature. As C falls, through better tooling, better models, better automation, the equilibrium quantity of production rises dramatically. This is basic microeconomics. When something gets cheaper, you get more of it.

But getting more output does not mean you can use more output. Every unit of output still has to be evaluated before it enters a downstream decision. And the cost of evaluation, call it R for review, does not fall at the same rate as C. R is bounded below by human cognition, by how fast a person can read, understand context, spot a flaw, and make a call. When C drops by 10x and R stays flat, review becomes the bottleneck. The entire production pipeline backs up at the evaluation step.

This is the core mechanism. It is not complicated. What is interesting is the second-order consequences it produces throughout an organization.

When marginal production cost falls toward zero, review cost does not follow. The bottleneck migrates to whoever can evaluate output fast and correctly. That person becomes the constraint on your output rate, and your pricing power follows them.

At Devlyn, I watched this happen in slow motion as we adopted AI-assisted workflows. We could generate draft communications, draft pricing recommendations, draft shift schedules, in seconds. The constraint shifted almost immediately to the people who could look at that output and say, with confidence: ship it, don't ship it, or here's what's wrong. The generators became abundant. The evaluators became scarce. We started paying closer attention to who our reliable evaluators were, and what made them reliable. That is when I started thinking seriously about the economics of judgment as a standalone input category.

What judgment actually is (and is not)

Judgment is not expertise, though expertise helps. It is not experience, though experience is one input. Judgment is the capacity to evaluate output correctly in cases that are partially novel, where the full context was not present in training, where the edge case was not in the playbook, where the right answer requires integrating multiple signals that are individually ambiguous.

This is why it is hard to automate. You can train a model on historical evaluations. But if the distribution of cases shifts, new market, new product, new regulatory environment, new customer segment, the model's evaluation accuracy degrades exactly when you need it most. Humans with good judgment do not degrade as fast, because they are reasoning from principles rather than pattern-matching on examples. The reasoning is slower, but it is more robust to distribution shift.

This is also why judgment is hard to hire for. You can screen for credentials. You can screen for experience. You cannot easily screen for the ability to reason correctly about a case you have never seen before. It requires a different kind of evaluation process, one that presents genuinely novel situations, not variants of familiar ones.

In the context of The Judgment Economy, the argument I work through is that the historical proxies we used for judgment, degrees, titles, tenure, credentials, are becoming noisier signals precisely as the value of the underlying thing increases. We need better ways to identify and price judgment, because the market for it is becoming the market for the highest-leverage input in production.

What this does to pricing

If you run a professional services firm, or a software company, or any business where output quality is what customers actually pay for, the zero-marginal-cost production environment changes your pricing model fundamentally.

The old model was: price on inputs. You charge by the hour, by the seat, by access to the people or tools that produce the work. This made sense when production was the scarce step. The hour of the senior consultant, or the license to the software, was the thing you were buying.

When production is cheap and evaluation is scarce, the input-based pricing model collapses. A customer can generate unlimited candidate outputs. What they cannot do is evaluate them reliably. What they are actually paying for is the certification, the confident, accountable assertion that a particular output is correct and safe to act on.

This means pricing should migrate toward outcomes. Not the deliverable itself, but the guarantee attached to the deliverable. The revenue model that survives a world of cheap generation is one where you are pricing your judgment, your accountability, your track record of being right in novel cases. Seat-based licensing and hourly billing are artifacts of a production-scarce world. They are being priced out of existence by the same forces that created the opportunity.

I explore this in more depth in Revenue, Re-Engineered: What a CRO sees that a CTO can't, but the short version is: the CRO's job in this environment is to find the layer of the value chain where judgment concentrates, price it accordingly, and make sure the contract reflects accountability rather than effort. Effort is a commodity. Accountability for outcomes is not.

What this does to org design and labor markets

The organizational consequence of cheap production and scarce evaluation is a specific and somewhat uncomfortable redistribution of leverage. People who can evaluate reliably gain leverage relative to people who can produce quickly. Senior people gain leverage relative to junior people, not because the work is harder, but because the evaluation step requires the judgment that tends to accumulate with experience.

At Devlyn, we have tried to make this explicit rather than leaving it implicit. The principle we operate on is: ownership over hours, outcomes over velocity. A senior engineer on our team is not valuable because they write more code per hour than a junior engineer. They are valuable because they own production readiness, they are the ones who can look at a system change and make a confident, accountable call about whether it is safe to ship. That capacity does not scale with headcount in the way that raw production capacity does.

This has real implications for how you structure teams. In a production-scarce world, you want to maximize throughput of the production step: more engineers, more writers, more analysts. In an evaluation-scarce world, you want to maximize the accuracy and speed of the review step. That often means fewer, better evaluators with clearly scoped accountability, not more reviewers doing redundant checks, but sharper evaluators making confident calls on a well-defined domain.

The labor market consequence is a bifurcation. The middle, people who are competent producers but not yet reliable evaluators, gets compressed. Their output is increasingly replaceable by generation tools. The top, people whose judgment is demonstrably reliable in novel situations, becomes more expensive, because the demand for that capacity is growing and the supply is not. The bottom, people learning to evaluate by doing production work, is still valuable, but only if the organization has a clear path from production to evaluation. If that path is blocked by automation, you lose the pipeline that creates future evaluators.

The middle of the labor market gets compressed not because those people lack talent, but because their output is newly replicable. The path from production to evaluation has to be actively preserved, or you hollow out your own future judgment supply.

This is one of the underappreciated risks of aggressive automation. If you automate the production step completely, you eliminate the apprenticeship path. Junior people learn to evaluate by first producing, by seeing what good output looks like from the inside, by making mistakes and understanding why they were mistakes. If the production step is done by a model, that learning pathway disappears. You have to find other ways to develop evaluative capacity, which is harder and less natural.

Where moats come from now

The classic technology moat is a switching cost or a network effect. Once enough people use your platform, it becomes harder to leave; the value of staying compounds. These moats still exist, but they are being compressed by the same forces that are compressing production costs. If the underlying output can be replicated cheaply by a competitor using the same generation tools you use, the switching cost drops. The moat has to be somewhere else.

In a judgment-scarce world, the moat is trust. Specifically, it is the accumulated track record of being right, in novel cases, over time, combined with the willingness to be accountable when you are wrong. This is harder to replicate than a production capability, because it is a function of history, of decisions made and observed, of calls that proved correct under pressure, of accountability that was honored when it was costly.

The price of responsibility becomes a moat in itself. A customer who needs a consequential decision certified, whether to deploy a system, whether to price an asset, whether to make an operational change, is not primarily looking for the lowest-cost generator of options. They are looking for the entity that will be accountable if the call is wrong. That entity charges a premium for taking that accountability on. And the premium is sustainable because the track record required to credibly offer that accountability takes years to build and cannot be faked.

This is a fundamentally different basis for competitive advantage than most technology companies have been building toward. It is closer to the basis that law firms, accounting firms, and insurance companies have operated on, where the product is not the deliverable but the warranty attached to the deliverable. In AI-Native, the framing I use is that the companies that will compound in this environment are not the ones with the best generation capability. They are the ones with the best-calibrated judgment, the most reliable track records, and the contract structures that reflect accountability for outcomes.

The CRO lens: margin migrates to the evaluation layer

From a revenue and margin perspective, the judgment economy has a specific and legible shape. Gross margin concentrates at the layer of the value chain where evaluation happens. This is true in professional services, in software, in operations. The production layer commoditizes. The evaluation layer, wherever it sits, retains margin.

The CRO's job is to find that layer and price it correctly. In practice, this means three things.

First, unbundle what you are selling. Most legacy pricing bundles access, production, and evaluation into a single fee. In a world where production is cheap, this bundle misprices everything. Customers who only need access pay too much. Customers who need evaluation pay too little. Unbundling lets you price each layer at its actual market value, which means pricing evaluation at a significant premium to production.

Second, make accountability explicit in the contract. If you are selling judgment, the contract should reflect that. This means warranties, guarantees, service levels tied to outcomes rather than effort. It also means being specific about what you are accountable for and what you are not. Vague accountability is worth nothing; specific, bounded accountability for a defined outcome is worth a lot.

Third, invest in your track record as a revenue asset. Every decision your organization makes, and is held accountable for, contributes to a corpus of demonstrated judgment. That corpus is an asset. It should be managed as an asset, tracked, analyzed, used to improve evaluation accuracy, and cited as evidence in customer conversations. The customer who is considering paying a premium for your judgment wants to see the evidence that the premium is warranted. Your track record is that evidence.

At Devlyn, we have started treating our operational decision history this way. The calls we made about inventory positioning, pricing adjustments, scheduling changes, the ones we made and owned, are a record of our judgment in action. When we can point to that record, we can price our judgment at a level that reflects its actual value. When we cannot, we are back to competing on production cost, which is a race to the bottom.

The practical consequence

The conclusion is not complicated, but it requires discipline to act on. If you are running a company, an operating unit, a product team, a revenue function, the question to ask is not "how do I produce more?" It is "how do I evaluate faster and more reliably, and how do I build the contract structures that let me charge for that?"

The companies that will compound in the next decade are not the ones that generate the most output. They are the ones that can certify output at scale, that have the evaluative capacity, the track record, and the contract structures to make that certification valuable to customers who need consequential decisions made correctly.

This is a different kind of organizational capability than most companies have been building toward. It is slower to develop, harder to replicate, and, precisely because of those properties, more durable as a basis for competitive advantage. The judgment economy does not reward velocity. It rewards accuracy, accountability, and the trust that accumulates from being both, consistently, in cases that matter.

That is what it means for deciding to be everything when doing is cheap. Not that decisions are harder, though they often are. But that the economic value of making them well, and being willing to own the outcome, has never been higher.

Frequently asked questions

What is the judgment economy? It is the regime that emerges when the cost of producing work falls toward zero. Generation becomes abundant, but every unit of output still has to be evaluated before it can be acted on, and the cost of evaluation does not fall at the same rate. Value migrates to whoever can tell good output from bad, fast, in cases they have never seen before.

Why is judgment scarce rather than creativity? Creativity is one of the things generation tools now produce cheaply. Judgment is the capacity to evaluate output correctly in partially novel cases, where the right answer requires integrating ambiguous signals from principles rather than pattern-matching on examples. That capacity is hard to automate, hard to hire for, and slow to build, which is exactly what makes it the binding constraint.

How should pricing change in a judgment economy? Pricing migrates away from inputs (hours, seats, access to production) and toward outcomes and accountability. When customers can generate unlimited candidate outputs, what they pay for is the confident, accountable assertion that a particular output is correct and safe to act on. If you are working out how to price judgment and accountability into a real revenue model, that is the kind of work my team does at Devlyn.

LLM Evaluation Tools Compared (From Production)

Alpesh Nakrani — Mon, 25 May 2026 18:30:00 GMT

The right LLM evaluation tool depends on which job you are doing: offline eval suites that gate a deploy, online monitoring that scores live traffic, or human-labeling workflows that produce ground truth. No single product does all three well, and most teams do not need a heavy platform. They need a thin evaluation layer they own, wired to one or two tools for the parts that are genuinely hard to build.

I write this as someone with nothing to sell you here. I do not ship an eval tool. Every vendor roundup you will find is written by a company that does, which is why they all conclude that you should buy a platform. That bias is the gap, and being honest about it is the only edge worth having. Here is the landscape as it actually splits in 2026, the named tools in each category, and where each one stops being worth the cost.

This piece sits under my complete guide to LLM evaluation and extends the argument in my essay on evals that predict production. Read those for the why. This one is about the what: which LLM evaluation tools earn a place in your stack.

No single eval tool does offline suites, online monitoring, and human labeling well. Buying one platform for all three means accepting that it is mediocre at two of them.

Key takeaways

There is no single best LLM evaluation tool, because evaluation is three jobs: offline CI gating, online monitoring, and human labeling. No product does all three well.
Build the thin layer that encodes your judgment, the golden set, the metric, and the threshold. Buy the heavy infrastructure, at-scale trace storage and in-request scoring.
Offline frameworks are mature open source: DeepEval, promptfoo, OpenAI Evals, and RAGAS gate a deploy from CI for free.
Online platforms (Braintrust, Langfuse, Arize Phoenix, LangSmith) earn their cost when you need to store and score millions of production traces.
Every roundup is written by a company that sells a tool, so every one says buy a platform. The neutral answer is to own your definition of good and rent the rest.

The three jobs eval tools actually do

Before you compare LLM eval tools, separate the jobs. The category is muddled on purpose, because a vendor that does one job well wants you to believe its product covers all three. It does not.

There are three distinct jobs, and they have different buyers, different cadences, and different failure modes:

Offline evaluation runs in CI against a frozen golden set. It gates the deploy. Cadence: every pull request. The output is a pass/fail and a regression diff.
Online evaluation and monitoring scores live production traffic. It catches drift after launch. Cadence: continuous. The output is a trace, a quality score, and an alert.
Human labeling produces the ground truth everything else calibrates against. Cadence: periodic. The output is a labeled dataset and an inter-rater agreement number.

One distinction collapses most confusion: evaluation, observability, and monitoring are not the same thing. Observability shows you traces. Monitoring alerts on them. Evaluation scores the output against a goal. A tool that gives you beautiful traces but no scoring has not evaluated anything. Match the tool to the job, not to the marketing.

Offline eval frameworks: the part you should mostly own

Offline eval frameworks are open-source libraries that run in your CI pipeline. This is the category where building your own thin layer pays off most, because the framework is just a test runner with model-graded checks. The named tools here are libraries, not platforms, and they are good.

DeepEval gives you a pytest-style harness with 50-plus metrics, including RAG-specific ones, so your evals look like unit tests and run in the same CI step (DeepEval docs). promptfoo is the lightweight choice for fast prompt iteration and red-teaming; OpenAI agreed to acquire it in March 2026, and the OSS repo stays MIT-licensed (Braintrust, promptfoo alternatives). OpenAI Evals scales to large suites and works well as the blocking CI gate. RAGAS owns retrieval metrics promptfoo lacks: context precision, context recall, faithfulness, and answer relevancy, scored per RAG stage. For raw model benchmarking across standardized tasks, EleutherAI's lm-evaluation-harness remains the reference (lm-evaluation-harness on GitHub).

The honest pattern that has emerged: each library owns a phase. promptfoo for iteration, OpenAI Evals or DeepEval for the CI gate, RAGAS for retrieval scoring. You can run all three because they target arbitrary providers and integrate with CI the same way. The trade-off is metric definitions drift between libraries, so calibrate them against one human-labeled set or the scores will not compare.

One 2026 development is worth a flag. promptfoo now sits inside OpenAI, and the repo stays MIT, but it raises a question that applies to any eval tool a model vendor owns: if the thing grading your output is built by a company that also sells a model, is the grade neutral? I am not saying it is rigged. I am saying neutrality is the whole point of an eval, and the safest place to hold your scoring logic is a repo you control. If you are deciding what to adopt for the long run, my work on AI observability and monitoring starts from that exact principle: own the judgment, rent the infrastructure.

Online platforms: where buying starts to make sense

Online evaluation and observability is where the platforms live, and where buying gets defensible. Storing and querying millions of production traces, running synchronous LLM-as-a-judge scoring inside the request lifecycle, and alerting on quality drift is real infrastructure. Building that from scratch is rarely the right call.

The serious options split by what they optimize for. Braintrust is strongest for enterprise teams that want self-hosted evaluation tied to release control, with a generous free tier (Braintrust, self-hosted evals). Langfuse is the open-source choice for teams who prioritize self-hosting and infrastructure control over evaluation depth. Arize Phoenix is open-source, OTel-native tracing with built-in LLM-as-a-judge and dataset management. LangSmith covers offline, online, and multi-turn evals in one place if you already live in the LangChain ecosystem. Helicone is proxy-based, so you get logging with a one-line integration but shallower scoring.

The de facto stack for engineering-led teams in 2026 is a CI library plus a production platform: DeepEval for the gate, Braintrust or Phoenix for traceability. That pairing is sensible. What is not sensible is buying the platform first and discovering it cannot express the one metric your feature actually fails on.

Human-labeling tools: the ground truth you cannot skip

Every automated score is only as trustworthy as the human labels it was calibrated against. This is the category teams underinvest in, then wonder why their LLM-as-a-judge correlates with nothing. You need a structured place for humans to apply a rubric, not a spreadsheet.

Label Studio leads the open-source side, built for structured human review of agent traces and LLM outputs with rubrics, spot checks, and escalation (Label Studio, LLM evaluation). On the enterprise side, Labelbox and SuperAnnotate bundle vetted annotator networks and consensus QA. The choice is mostly build-versus-buy on the labor, not the software: do you have reviewers in-house, or do you need a vendor's talent pool?

Here is the part that connects to revenue. Skipping human labeling does not save money; it defers the cost to an incident. An automated judge that drifts from human judgment will pass a broken model, that model ships, and a customer finds the failure before your dashboard does. The labeling step is cheap insurance against an expensive escape. A human reviews it is not a plan, but a calibrated rubric run periodically is.

LLM evaluation tools compared

This table maps the categories to representative tools and the honest limit of each. Use it to decide what to build and what to buy, not as a ranking. The best LLM evaluation platform is the one that fits the job you actually have.

Category	What it is for	Representative tools	Honest limit
Offline eval frameworks	CI-gated suites against a golden set; the deploy gate	DeepEval, promptfoo, OpenAI Evals, RAGAS	Metric definitions drift between libraries; you still own the golden set and the threshold
Model benchmarking	Same task suite across many models, identical scoring	EleutherAI lm-evaluation-harness	Measures generic capability, not your task; weak signal for product gating
Online eval and observability	Trace, score, and alert on live production traffic	Braintrust, Langfuse, Arize Phoenix, LangSmith, Helicone	Trace-heavy, scoring-light in some; lock-in risk; cost scales with traffic volume
Human labeling	Ground-truth labels and rubrics to calibrate judges	Label Studio, Labelbox, SuperAnnotate	Software is the easy part; labor cost and reviewer consistency are the hard part
Thin layer you own	Glue: golden-set storage, thresholds, the gate decision	Your repo, ~200 lines of code	You maintain it; but no vendor can express your failure mode better than you can

When to build your own vs buy a platform

The decision is not build-or-buy across the board. It is build the thin layer, buy the hard infrastructure. The thin layer is the part vendors quietly assume you will own anyway, and it is where your real evaluation logic lives.

Build it yourself when the job is: deciding what goes in the golden set, defining the metric that maps to your specific failure mode, setting the pass threshold, and owning the gate decision in CI. That is roughly 200 lines wrapping an open-source library, and it is production code: curated cases, automated scoring, regression alerting, and a queryable history that answers "is it better?" with a number. No platform knows your failure modes better than you do.

Buy the platform when the job is at-scale trace storage, synchronous in-request scoring, drift detection across millions of events, or a labeled-data labor pool you do not have. Re-implementing that is months of work for an undifferentiated result.

Build the thin layer that encodes your judgment. Buy the heavy infrastructure that is undifferentiated. The mistake is buying the platform and letting it own your definition of good.

The trade-off worth naming: a platform gives you velocity now and a migration tax later. Your eval definitions, golden sets, and judge prompts get expressed in the vendor's schema, and moving off costs real engineering time. That is a fine trade if you ship faster and the lock-in is priced in. It is a bad trade if you bought the platform to avoid thinking about what to measure, because then the tool quietly decides your quality bar for you. For the deeper version of this argument, see how to build an LLM evaluation framework, which covers the harness in detail.

The failure mode every tool roundup hides

Tool comparisons rank features. They almost never name the failure that kills eval programs in practice: the team adopts a platform, gets a green dashboard, and stops asking whether the dashboard measures the right distribution. The tool was never the problem. The golden set was unrepresentative and the judge was uncalibrated, and a prettier UI hid both.

I have watched this twice. A team buys an observability platform, wires up traces, and reports a quality score that trends nicely up and to the right. The score is real. It is also measuring synthetic happy-path cases, not the adversarial tail that actually breaks production. The tool did exactly what it was sold to do. The judgment about what to feed it never happened.

This is why the build-versus-buy line matters more than the tool choice. The thin layer you own is where you decide what good means: which cases go in the set, what threshold blocks a deploy, and how often you re-pull from production. Outsource that decision to a vendor's defaults and you have bought a confident number that nobody validated. For the sampling problem underneath this, my essay on evals that predict production is the deeper treatment, and the LLM-as-a-judge calibration question gets its own breakdown in when to trust the model grading the model.

The revenue framing is blunt. An eval tool that reports the wrong number is worse than no tool, because it converts a known unknown into a false sense of safety. You ship faster and you ship broken. The cheapest insurance is the part no platform sells you: a representative golden set and a judge calibrated against humans. Buy the infrastructure. Own the judgment.

Frequently asked questions

What are the best LLM evaluation tools in 2026?

There is no single best tool, because evaluation splits into three jobs. For offline CI gating, DeepEval, promptfoo, OpenAI Evals, and RAGAS are the strong open-source frameworks. For online monitoring, Braintrust, Langfuse, and Arize Phoenix lead. For human labeling, Label Studio is the open-source standard. Most teams combine a CI library with one production platform and own a thin layer that holds their golden set and thresholds.

Should I build my own LLM eval framework or buy a platform?

Build the thin layer that encodes your judgment: the golden set, the metric, the threshold, the gate. Buy the heavy infrastructure: at-scale trace storage, in-request scoring, and labeled-data labor. The thin layer is roughly 200 lines around an open-source library. The platform is months to rebuild. Owning the layer keeps your definition of good in your hands instead of a vendor's schema.

What is the difference between LLM evaluation, observability, and monitoring?

Evaluation scores output against a goal and produces a pass/fail. Observability shows you traces of what happened. Monitoring alerts when a tracked signal crosses a threshold. A tool can give you rich traces and still evaluate nothing. When comparing LLM eval tools, check that the product actually scores quality, not just that it logs requests.

Are open-source LLM evaluation frameworks good enough for production?

Yes, for the offline gate. DeepEval, promptfoo, OpenAI Evals, and RAGAS are mature, maintained, and widely adopted in production CI. Where open-source gets harder is at-scale online monitoring, where storage and synchronous scoring are real infrastructure. A common production setup pairs an open-source CI framework with either a self-hosted open platform like Langfuse or Phoenix, or a managed one if you would rather not run it.

If you want this wired into a real product with monitoring and a gate that holds at 3am, that is the work a Devlyn AI observability and monitoring engagement does: the thin layer you own, the platform you buy, calibrated against ground truth from day one. For the full discipline behind the tools, my book A Field Guide to Evals is the longer answer.

'A human reviews it' is not a plan

Alpesh Nakrani — Sun, 24 May 2026 18:30:00 GMT

Every AI rollout I have seen in the last two years has the same slide. It shows a flowchart with boxes for "model generates output," "human reviews output," and then a green arrow labeled "approved." The presenter clicks past it in under ten seconds. Nobody asks questions. It feels self-evidently responsible. Of course a human reviews it. What kind of reckless organization would skip that step?

The problem is that "a human reviews it" is a sentence fragment pretending to be a policy. It answers none of the questions that actually determine whether oversight does anything. Which human? Reviewing what, exactly, the raw output, the downstream effect, the user-facing rendering? Against what rubric? With what authority to reject, revise, or escalate? At what sampling rate? Within what latency window? What happens when that person is out sick, on a call, or has 400 items in their queue?

I have been thinking about this ever since we started running agentic workflows at Devlyn at real volume, thousands of interactions a day, not hundreds. "Human in the loop" as a phrase does not survive contact with a production system at that scale. What survives is a design. And most organizations have not designed anything. They have named a person and called it a process.

The queue collapse you do not see coming

Here is how it usually goes. The pilot looks great. Volume is low, the reviewer is a subject matter expert who is also deeply motivated because this is new and interesting. Approval latency is two hours. Quality is excellent. Leadership signs off on expansion.

Volume doubles. Then doubles again. The reviewer's queue grows from 20 items a day to 80 to 300. They start skimming. They stop reading the full output and start reading the first paragraph and the final sentence. They begin trusting the model more than they should, not because they are lazy but because the model has been right 97 times in a row and the alternative is staying at the office until 9 p.m. every night. The approval rate for genuinely bad outputs climbs from near zero to somewhere uncomfortable, but nobody notices because nobody is measuring it.

This is not a story about bad people. It is a story about a system that was never designed for the load it was asked to carry. The reviewer was a single point of failure, and the failure mode was invisible, not a crash, not an error log, just a slow drift toward rubber-stamping. The human in the loop became a human downstream of the loop, signing off on whatever the model decided.

The failure mode is not a crash. It is a slow drift toward rubber-stamping, a human who is technically reviewing and functionally not.

I have seen this happen in three different companies in the last eighteen months, in domains ranging from customer-facing copy to clinical documentation summaries to financial recommendations. The dynamics are identical. The timeline varies by volume, but the end state is the same: a reviewer who is now a liability, not a safeguard, because decisions made under their nominal oversight carry the institutional weight of human approval without the substance of human judgment.

The incomplete sentence problem

When I say "human in the loop is an incomplete sentence," I mean it structurally. A proper oversight design requires answers to at least six questions before it can be evaluated as a plan.

Who reviews? Not a job title, a specific person or rotation, with defined backup coverage. If the answer is "whoever is available," the answer is nobody.

What do they review? The raw model output, the post-processing result, the user-visible artifact, the logged metadata? Each has different failure modes. Reviewing only the output and not the downstream rendering has caused some of the most embarrassing AI incidents I know of.

Against what rubric? An expert reviewing without a rubric is not evaluating, they are pattern-matching to intuition. Intuition is valuable, but it cannot be transferred, calibrated, or measured. If two reviewers cannot agree on what "good" looks like for the same output, the process is producing noise.

With what authority? Can the reviewer reject? Modify? Escalate? If rejection means the item goes back into the queue and gets re-reviewed by the same person with no additional signal, rejection is a gesture, not a control.

At what sampling rate? 100% review is a specific operational bet that review adds more value than it costs. That is sometimes true and often not. Most organizations default to 100% because it sounds rigorous, not because they have done the math.

At what latency? A review process with a 72-hour turnaround in a customer-facing context is not oversight, it is post-hoc audit with extra steps. The action has already happened. The damage, if any, has already propagated.

I wrote more about this framing in "Human in the Loop Is Not a Plan", the core argument is that oversight is only meaningful when it is specified precisely enough to be falsifiable. If you cannot describe what a failure of your oversight process looks like, you have not designed an oversight process.

Risk-tiered review and the capacity math

The practical alternative to "a human reviews everything" is not "nothing gets reviewed." It is a tiered model based on output risk and model confidence, with explicit capacity math for each tier.

At Devlyn, we think about this in three buckets. High-stakes outputs, anything that affects a patient's treatment path, a prescription, or a financial commitment, get 100% human review with a defined rubric, a credentialed reviewer, and a documented decision trail. Mid-stakes outputs, scheduling changes, care-plan summaries, patient communications, get sampled review at a rate calibrated to our current confidence in the model's error rate. Low-stakes outputs, administrative confirmations, appointment reminders, internal workflow nudges, go with model confidence scoring and exception flagging only.

The capacity math looks roughly like this:

Tier 1 (100% review): 200 outputs/day × 4 min/review = 800 min/day = 2 FTE dedicated reviewers
Tier 2 (10% sample): 2,000 outputs/day × 10% × 3 min/review = 100 min/day = fractional reviewer
Tier 3 (exception only): 8,000 outputs/day × 0.5% exception rate × 5 min/review = 200 min/day

Total review load at 10,200 outputs/day: ~1,100 min/day, manageable with 3 reviewers
If all 10,200 went to Tier 1: 40,800 min/day, you would need 17 reviewers

The math is not the interesting part. The interesting part is that most organizations never do it. They start with "a human reviews it," watch the queue back up, quietly reduce the review burden by skimming or sampling without ever formally changing the policy, and end up with the worst of both worlds: the liability of claiming 100% review with the actual coverage of something much lower.

The formal tiering forces you to make the tradeoff explicit. You are stating, for the record, that you believe low-stakes outputs with high model confidence do not warrant line-by-line human review, and here is the evidence that belief is calibrated. That is a defensible position. "We reviewed everything" that is actually "one exhausted person approved everything in their queue" is not.

Rubrics, calibration, and the disagreement that exposes the gap

Even when you have the right reviewers, the right tier, and the right capacity, you can still have a non-functional oversight process if reviewers are not calibrated to each other.

Calibration is uncomfortable to talk about because it implies that expert judgment can be wrong, and most organizations would rather not surface that. But inter-rater reliability is a basic measurement. If you give the same 20 outputs to two experienced reviewers and they disagree on 8 of them, you have a rubric problem, not a personnel problem. The rubric, or its absence, is producing inconsistent results, which means the "human review" you are counting on is measuring something different person to person.

The right response to calibration failures is not to pick the reviewer whose judgments you prefer. It is to go back to the rubric, find the specific dimensions where disagreement is highest, make the criteria more explicit, and re-run the calibration. This is tedious. It takes longer than just approving things. But it is the only way to turn reviewer judgment into something that can be sampled, audited, and improved over time.

The "A Field Guide to Evals" framework I have found most useful separates evaluation into three layers: outcome metrics (did the thing we wanted to happen happen?), output quality metrics (was the model's output correct by the rubric?), and process metrics (did the review happen on time, at the right sampling rate, with documented rationale?). Most human-review processes only measure the third layer, was the form filled out?, and assume that implies the first two.

Sampling plus evals as the actual plan

At production volume, the sustainable alternative to 100% human review is a combination of automated evals and structured sampling. This is not a novel idea, it is how quality control works in manufacturing, clinical trials, software QA, and financial auditing. You do not manually inspect every widget. You define acceptance criteria, sample at a statistically meaningful rate, measure defect rates, and set thresholds that trigger escalation.

Applied to AI outputs, this means:

Automated evals run on 100% of outputs. These are not LLM-as-judge opinion scores (though those have their place in calibration). They are deterministic checks: does the output contain required fields? Does it fall within acceptable ranges? Does it contradict a known fact in the patient's record? Does it reference something the model should not have access to? These checks catch the obvious failures at zero marginal cost per additional output.

Structured human sampling runs on a rate calibrated to your error tolerance. If your automated evals are catching 90% of failure modes, and your manual sampling over 500 outputs shows a 1.2% undetected error rate, you know your current eval coverage and can make an informed decision about whether 1.2% is acceptable given output stakes. That is a risk management decision, not a philosophical one.

Exception escalation pulls specific outputs to mandatory human review. The automated evals that fire, the outputs that fall below model confidence thresholds, the outputs that touch high-risk categories, these go to a human with a defined SLA and a rubric. The human is reviewing because something specific triggered escalation, not because the queue happened to route to them.

Oversight is a system with defined inputs, outputs, and failure modes, not a person whose title implies responsibility.

This architecture scales. When volume doubles, your eval infrastructure scales horizontally. Your human reviewers, now working on a sampled and exception-driven basis, maintain consistent quality because they are not drowning. The sampling rate and exception thresholds become the levers you tune as the model improves or as you discover new failure modes.

I am not going to lay out the mechanics of building that eval and sampling layer here, that is its own discipline, and I wrote a separate, more procedural piece on how to actually run human-in-the-loop evaluation. This essay is about why the one-line version is not a plan in the first place.

Autonomy boundaries that scale with evaluation strength

There is a principle that I think is underappreciated in conversations about human oversight: the appropriate level of AI autonomy should scale with the strength of your evaluation system, not with the novelty of the technology or the comfort level of leadership.

This sounds obvious but runs counter to how most rollouts are actually paced. Organizations typically start with high autonomy in the pilot (when everything is new and being watched closely), then add human review as they scale (when volume makes close watching impossible), then quietly reduce human review as the queue collapses (when they have the most volume and the least oversight). This is exactly backwards.

The right sequence is: start with 100% human review while you build your eval suite. As your evals gain coverage and you can measure model performance reliably, expand the model's autonomy in proportion to your measurement confidence. Never expand autonomy beyond the reach of your evaluation system. If you cannot measure it, you cannot govern it, and you certainly cannot catch it when it fails.

At Devlyn, we use the phrase "senior engineers own production readiness" to mean something specific about this. An engineer who ships an agentic feature does not hand it off to a reviewer and walk away. They own the eval coverage for that feature. They define what a failure looks like, instrument the detection, and set the thresholds. The "human in the loop" for their feature is a designed system they built and are accountable for, not a colleague they handed the queue to. When the eval coverage is strong and the error rate is low, they earn the right to reduce human review. When something unexpected shows up in the metrics, they increase it. Autonomy is a function of evidence, not of time elapsed since launch.

This does not mean every engineer needs to be an eval expert. It means that production readiness for AI features includes the oversight design as a first-class artifact, specified, reviewed, and revisited on a defined cadence.

Designing oversight as a system

The reframe I keep coming back to is this: oversight is infrastructure, not reassurance. "A human reviews it" is reassurance. It sounds responsible. It satisfies the compliance checkbox. It makes the slide deck feel complete. But it does not function as infrastructure because it has no defined inputs, no acceptance criteria, no failure modes, no scalability model, and no measurement framework.

Oversight infrastructure asks and answers: what is the expected error rate of this model in this context? What is the acceptable error rate given the stakes of each output tier? What automated checks catch which failure categories? At what sampling rate does human review add marginal value over the automated baseline? What does the escalation path look like, who is on it, and what SLA are they held to? What happens when the error rate crosses a threshold, does autonomy contract automatically, or does someone have to make a call?

These questions are not especially hard to answer. They are just boring to answer. They require the kind of careful, operational thinking that does not make the slide deck feel exciting. Nobody has ever gotten a standing ovation for presenting their inter-rater reliability calibration protocol. But this is exactly the work that separates organizations that have genuinely safe AI deployments from organizations that have AI deployments with a human somewhere nearby who can be blamed if something goes wrong.

The reviewer who becomes a rubber stamp is not the problem. The rubber stamp was the design. What you built was a system where volume would eventually exceed capacity, where the rubric was implicit and not transferable, where the reviewer had no way to know what they were supposed to be catching, and where nothing measured whether catching was actually happening. The human was a prop in a process that was never specified well enough to function.

If your current AI oversight plan would stop functioning correctly if that one reviewer took a two-week vacation, it is not a plan. It is a dependency. Plans survive the individuals inside them. Oversight systems should too.

Start with the questions. Write down the answers. Make the rubric. Do the capacity math. Build the evals. Set the thresholds. Then put a human in the loop, but only in the specific, instrumented, scalable way that actually means something.

If you are trying to build oversight that survives production volume rather than a reviewer who quietly drowns in it, this is the kind of work my team does at Devlyn.

Frequently asked questions

What does "human in the loop" actually mean for AI systems? In practice it should mean a designed oversight system: a specific reviewer or rotation, a defined rubric, a sampling rate, an authority to reject or escalate, and a latency window. Most of the time it means none of that, just a person named on a slide. If you cannot describe what a failure of your oversight process looks like, you do not have human in the loop, you have a dependency.

Why does human-in-the-loop review break down at scale? Volume grows faster than reviewer capacity. The reviewer starts skimming, then trusts the model because it has been right many times in a row, and the process drifts toward rubber-stamping without anyone measuring it. The failure is invisible: no crash, no error log, just a human who is technically reviewing and functionally not.

What is the alternative to reviewing every AI output by hand? Risk-tiered review plus automated evals and structured sampling. Run deterministic checks on 100% of outputs, sample human review at a rate calibrated to your error tolerance, and route exceptions and high-stakes cases to a reviewer with a rubric and an SLA. Then scale model autonomy in proportion to how well you can measure it, never beyond the reach of your evaluation system.

Why most RAG pipelines fail in month three

Alpesh Nakrani — Sat, 23 May 2026 18:30:00 GMT

Let me tell you what month three looks like. You shipped a RAG pipeline in month one. The demo was clean, you typed in a question, the right chunk surfaced, the LLM answered coherently, your stakeholders nodded. Month two you connected it to the real data store and it still mostly worked. Month three something quietly broke. You can feel it in the support tickets. Users ask questions that should be answered. The system returns something plausible but wrong, or returns nothing useful at all. No alert fired. No error log. Recall just eroded, and nobody noticed until a customer noticed for you.

I have watched this happen at three different companies now. Different stacks, different domains, same arc. The demo retrieves perfectly. Then the corpus grows, the queries drift, and recall collapses silently. The pattern is consistent enough that I have stopped treating it as bad luck and started treating it as the default outcome of pipelines built without a retrieval discipline. This essay is about what creates that failure mode and what closes it.

RAG demos pass because they run against a frozen corpus and queries you already know. Production has neither.
Corpus drift is three problems, not one: coverage drift, staleness drift, and query drift. Each needs a different fix.
The failure stays invisible because teams monitor answer quality, not retrieval. A capable model papers over bad retrieval until recall has already collapsed.
Long context does not replace retrieval. It dilutes signal and shifts the burden to attention in a regime where models perform worse.
The closes are operational, not algorithmic: build a retrieval eval harness before you ship, sample production queries into it, audit chunking on new content types, run hybrid retrieval with a reranker, and schedule re-embedding.

The toy corpus problem

Every RAG pipeline I have seen starts its life against a frozen toy corpus. Maybe it is a hundred hand-picked documents. Maybe it is a year of historical support tickets that someone cleaned up and deduplicated. Either way, it has a property that production corpora almost never have: it is stable. Nothing is being added. Nothing is being modified. The distribution of topics does not shift week to week. You chunk it once, embed it once, index it once, and then you write your retrieval code against something that will never change underneath you.

The chunking strategy you picked, maybe 512-token windows with 10% overlap, performs fine on that corpus. The embedding model you chose generalizes well enough across the topics represented. You build your recall intuition on that foundation, which means you build it on something that will stop being true the moment you go to production.

At Devlyn, we work with optical retail networks. The product catalog is not static. Frames come in and out of stock. Vision plan coverage rules update when carriers renegotiate. New lens technology gets added to the lineup. Clinical protocols evolve. When we stood up our first RAG prototypes, we built them on catalog snapshots that were already thirty days stale. They worked beautifully on those snapshots. Then we pointed them at the live data pipeline and the problems started within weeks.

Corpus drift is not one problem, it is three

When I talk about corpus drift I mean three distinct phenomena that are easy to collapse into one word but require different responses.

The first is coverage drift: new content arrives that covers topics your index has never seen. Your embedding model may not have good representations for those topics. Your chunking may not handle the new document structure well. The new content exists in the index, but retrieval on queries about it underperforms because the model was never calibrated against it.

The second is staleness drift: old content becomes wrong. The document that used to correctly describe your return policy now describes a policy you changed six months ago. If you are not surfacing document age as a retrieval signal, you will happily return stale chunks with high cosine similarity to the query. The embedding does not know that the content is wrong, it only knows that it is semantically similar.

The third is query drift: your users stop asking what they were asking in month one. This one is subtle because nothing changes in your pipeline, but the match between your retrieval behavior and your actual workload degrades. If you tuned your chunk size and overlap against questions about product features, and your users start asking questions about installation, troubleshooting, and compatibility, your retrieval parameters may be wrong for the new query distribution even if the corpus itself has not changed.

Most teams discover all three of these together, which makes diagnosis confusing. The fix for coverage drift is different from the fix for query drift. You need to be able to separate them.

The demo worked because you optimized retrieval for a corpus that was never going to change, against queries you already knew. Production is neither of those things.

The silent collapse: no retrieval evals

Here is the mechanism by which all of this stays invisible for two months: there are no retrieval evals. I do not mean that teams are lazy. I mean that the evaluation infrastructure that would catch recall degradation is genuinely hard to build if you have not done it before, and most teams prioritize end-to-end answer quality instead because that is what users care about and what demos show.

The problem with evaluating only end-to-end answer quality is that a capable LLM can often paper over retrieval failures in the short term. If you retrieve three mediocre chunks, a good model will sometimes synthesize a plausible answer anyway. This creates a false signal: your answer quality metrics look acceptable, so you conclude retrieval is fine, so you do not build the retrieval-specific evals that would show you the truth. Meanwhile recall@5 is drifting from 0.82 in month one to 0.61 in month three and you have no chart that shows it.

# retrieval eval log, sampled weekly from production query logs # metric: recall@5 on labeled golden set (n=200 query/chunk pairs) week recall@5 p50_latency_ms corpus_size 01 0.84 142 4_211 05 0.81 155 6_034 09 0.76 178 8_917 13 0.61 203 12_440 # ← stakeholder escalation

That table is roughly what we saw at Devlyn when we finally built the retrieval eval harness. Recall was already at 0.61 before the first user escalation reached leadership. We had been looking at generation quality and latency the whole time. Both looked acceptable. Recall was bleeding out quietly in the background.

The reason retrieval evals get skipped is that they require a golden set: a collection of query-and-relevant-chunk pairs that you can measure recall against. Building that golden set requires effort, human annotation, or a carefully reviewed LLM-assisted annotation pass, or both. Most teams defer it to "after we ship" and then never find the time. I used to defer it too. I stopped after the third time a pipeline degraded silently on me.

Why long context is not the answer

When teams discover retrieval degradation, the first instinct is usually to retrieve more chunks and pass more context. If recall@5 is bad, why not retrieve twenty chunks? If twenty chunks does not fit in the context window, why not use a model with a longer context window? The problem looks like a window size problem, so the solution looks like a bigger window.

I want to be careful here because long context has real uses. But it is not a substitute for retrieval quality, and treating it as one creates its own failure modes.

The core issue is that long context is not memory. A model with a 128k context window does not attend uniformly to all 128k tokens. Empirically, models perform worse on tasks that require locating relevant information in the middle of a long context than at the beginning or end. Retrieval solves the problem of getting the right information in front of the model. Padding the context with more chunks does not solve that problem, it dilutes the signal-to-noise ratio and shifts the burden to the model's attention mechanism in a regime where it performs less reliably.

There is also a cost and latency argument. Retrieval that works means you can use a tight context window with high-precision chunks. Retrieval that does not work means you are sending large contexts to a large model on every query, paying for tokens that are mostly noise. The economics of a production RAG system depend more heavily on retrieval precision than most teams realize when they are in demo mode.

The right mental model is that retrieval and context are not substitutes, they are complements. You want retrieval to do its job precisely so that context can be used efficiently. If retrieval is failing, adding context window is treating a symptom, not the disease. I cover this in more depth in "Long Context Is Not Memory" if you want the full breakdown of where long context helps versus where it is covering for something else.

Chunking that stops working

One of the more counterintuitive failure modes is that your chunking strategy can be correct for your initial corpus and wrong for the corpus you end up with six months later. Chunking is not a one-time architectural decision, it is a choice that is implicitly coupled to your document structure, your embedding model's context window, and the granularity at which your queries operate.

Consider a knowledge base that starts as a set of product FAQs: short, self-contained documents where a single question and answer fit cleanly into a 512-token chunk. That structure works well. Then the company adds technical documentation, multi-page installation guides, API references, troubleshooting trees. Those documents have a different structure. The meaningful unit of information is not a 512-token window; it might be a section, or a procedure, or a set of related steps. Fixed-size chunking fragments them at arbitrary boundaries, and the resulting chunks lose the context that makes them useful.

The honest version of this is that chunking is a domain-specific problem. There is no universal chunk size. The right answer depends on what you are chunking, how your users query it, and what your embedding model does with chunks of that size. The "Embeddings, Honestly" chapter on chunk size has the detail on how embedding quality degrades at the extremes, but the operator takeaway is simpler: you need to audit your chunking strategy whenever your corpus structure changes significantly, and you need retrieval evals to tell you when it has stopped working.

At Devlyn we ended up with a hybrid chunking approach: fixed-size for structured product data where sections are short and uniform, document-structure-aware chunking for clinical and regulatory content where sections carry their own semantic unit. We did not plan for that in month one. We discovered it in month three when retrieval on clinical queries degraded while retrieval on product queries stayed stable. The signal came from the evals.

Embeddings go stale and nobody schedules the update

Embedding models are not static in their relationship to your corpus. Two things change over time that affect this relationship: the embedding model itself may be updated or replaced, and your corpus vocabulary may drift away from the distribution on which the model was trained.

The second one is more common and more insidious. If you are operating in a domain that has specialized terminology, medical, legal, financial, technical, and new terminology enters your corpus that was not well-represented in the model's training data, retrieval on queries using that terminology will underperform. The model has never seen the term in a way that would give it a useful embedding. Queries using the term match poorly even to documents that are directly about it.

Most teams address this by fine-tuning their embedding model on domain-specific data at the outset, which is good practice. What fewer teams do is schedule any kind of re-embedding cadence. The embedding index is treated as a build artifact, you generate it once, you ship it, and you maintain it only when something obviously breaks. That cadence is too slow for a corpus that is growing and changing continuously.

What I have landed on: quarterly re-embedding of the full corpus when the corpus is under fifty thousand chunks, more frequent for subsets of the corpus that are changing rapidly. This is operationally annoying but the cost is predictable and the alternative, recall degradation that compounds over time, is much more expensive to deal with reactively. The "Retrieval That Survives Contact" chapter on embedding lifecycle has a more detailed treatment of how to decide on re-embedding cadence based on your corpus change velocity.

You will not catch recall degradation by monitoring answer quality. You will catch it by running retrieval evals on production-sampled queries, on a cadence, against a golden set you built when the system was working.

How I actually close the gap

None of what I am describing requires exotic technology. The closes are operational, not algorithmic. They require discipline and scheduling, not new models.

First, build the retrieval eval harness before you ship. Not after. A golden set of two hundred query-chunk pairs, annotated by domain experts or carefully reviewed after LLM-assisted generation, is enough to give you a meaningful recall@5 signal. Run it weekly. Put it in your dashboards next to latency and error rate. The moment recall starts drifting, you want to know it.

Second, sample your production query log continuously and add new queries to your eval set. The queries you had in month one are not the queries you will have in month six. Your eval set needs to stay representative of the actual query distribution. Set a threshold, say, once a quarter, add fifty production queries to the golden set with fresh annotations. This keeps the eval honest as query drift happens.

Third, audit your chunking whenever you add a new content type. Before new document categories go into your index, spend time understanding their structure. Do a manual retrieval evaluation: take twenty representative queries for the new content type, run them against your current chunking, look at what comes back. If the chunks look fragmented or decontextualized, your chunking strategy needs to change for that content type.

Fourth, implement hybrid retrieval with a reranker. Pure dense retrieval, cosine similarity against embeddings, is a good baseline but degrades more sharply as your corpus grows and as vocabulary shifts. Hybrid retrieval, sparse keyword matching like BM25 combined with dense retrieval, with a reranker to reconcile the two signal sets, is more robust to corpus growth and vocabulary drift. This is also where agentic retrieval starts to earn its keep, letting the system decide how to query rather than firing a single fixed lookup. The reranker can be a cross-encoder or a lighter learned model; either way it provides a second pass that is less sensitive to the specific embedding model's blind spots.

Fifth, schedule re-embedding and make it a first-class operational task. Put it on a calendar. Treat it like a database migration: planned, reviewed, tested against your eval set before it goes to production. An embedding update that drops recall is a regression. Treat it like one.

The through-line is that production RAG requires ongoing retrieval operations, not just retrieval architecture. The architecture decisions matter, chunking strategy, embedding model choice, hybrid versus dense retrieval, but they degrade over time without the operational layer to maintain them. At Devlyn, owning production readiness means owning the full lifecycle: the initial architecture and the maintenance cadence and the eval infrastructure that tells us when something is slipping before a customer tells us first.

The demo retrieves perfectly because it was designed against a corpus that was never going to change, with queries you already knew. That is not a problem with the demo, demos are supposed to show the happy path. The problem is treating the demo as evidence that the system is production-ready. It is evidence that the architecture can work. Whether it actually works in six months depends on what you build around it.

Most teams build nothing around it. That is why month three looks the way it does. The retrieval eval is not glamorous work. It does not ship features. It does not impress stakeholders. It is the discipline that keeps everything else from quietly falling apart.

Frequently asked questions

Why do RAG pipelines fail in production? They are usually built and tuned against a frozen toy corpus with a known set of queries. In production the corpus grows and changes, the query distribution drifts, and retrieval quality erodes. Because most teams monitor end-to-end answer quality rather than retrieval, the degradation stays invisible until a customer hits it.

How do you detect retrieval degradation before users do? Build a retrieval eval harness with a golden set of query-and-relevant-chunk pairs and track a metric like recall@5 on a cadence, next to latency and error rate. Answer-quality metrics will not catch it, because a capable model can paper over bad retrieval for a while.

Is a longer context window a substitute for good retrieval? No. Models attend less reliably to information in the middle of a long context, so padding the window with more chunks dilutes the signal and raises cost and latency. Retrieval and context are complements: precise retrieval lets you use a tight, high-precision context efficiently.

How to Evaluate an AI Agent (Evals for Agents)

Alpesh Nakrani — Fri, 22 May 2026 18:30:00 GMT

Evaluating an AI agent means scoring the whole trajectory, not just the final answer. You measure tool-call correctness, step efficiency, recovery from a failed step, and outcome success against a frozen task set. Single-output evals ask "was the answer right?" Agent evals ask "did the system take a sane path to a verifiable end state, and would it do that again?"

This distinction is not academic. An agent can produce a correct-looking final message while having called the wrong tool, mutated the wrong record, and burned forty steps to get there. Score only the last message and you certify a system that is one edge case away from an incident. That is why "a human reviews it" does not scale to multi-step agents: a person can spot-check a paragraph, but no one is reading a 40-step trace on every run at production volume.

A single-output eval grades the destination. An agent eval grades the journey, the destination, and the cost of getting there.

Key takeaways

Before the detail, the claims this piece will defend:

Trajectory beats output. Score tool-call correctness, step count, and recovery, not only the final response.
Outcome and process are different metrics. Outcome asks "did the database reach the right state?" Process asks "was the path sane?" You need both.
Freeze the task set. An ai agent evals harness needs a fixed, versioned set of tasks with known goal states, sampled from real traffic.
Consistency is its own metric. Run each task many times. On public benchmarks, agents that pass once often fail when you demand they pass eight runs in a row.
Human-gate the tail. Automate the bulk of scoring; route the low-confidence and high-blast-radius slice to a senior reviewer.

Why agent eval differs from single-output eval

A single LLM call has one degree of freedom: the text it returns. You score that text against a reference or a rubric and you are done. An agent has many degrees of freedom. It chooses a tool, reads the result, updates its plan, chooses again, and loops until it decides it is finished. Each of those choices is a place the run can go wrong while still ending on a plausible sentence.

I covered the production failure modes in my honest accounting of what agents can do today: agents confidently pursue the wrong goal, hallucinate a tool output and proceed as if it were real, or take an irreversible action and then report success anyway. None of those failures show up if your eval reads only the final message. They show up in the trajectory.

So llm agent evaluation measures things a chat eval never has to. Did the agent pick the right tool for the step? Did it pass valid arguments? When a call failed, did it retry sensibly or spiral? How many steps did it take versus the known-good path? Did it stop, or did it loop until a timeout? The final answer is one signal among six or seven.

Trajectory and step metrics: what to actually measure

A trajectory is the ordered log of everything the agent did: each tool call, its arguments, the observation it got back, and the reasoning step that chose it. Evaluating an agent means scoring that log. The metrics I track on every run:

Tool-selection accuracy. Of the steps that called a tool, what fraction called the right one? This requires the full tool specification in the trace, not just the tool name. You cannot judge "right tool" without knowing what the alternatives were.
Tool-argument validity. Right tool, wrong arguments is a distinct failure. Score it separately or you will conflate two bugs that have two different fixes.
Step efficiency. Actual steps divided by the steps on the known-good path. A 1.0 is optimal. A 3.0 means the agent wandered, which costs tokens, latency, and money even when the outcome is correct.
Recovery rate. Of the runs where a tool call failed or returned an error, what fraction recovered and still reached the goal? This is the metric that separates a brittle demo from a system you can ship.
Loop and timeout rate. The fraction of runs that never terminated on their own. A high number here is a hard blocker regardless of accuracy.

Industry tooling has converged on this trajectory-first view in 2026. Vendors now ship pre-built templates for tool-call accuracy, trajectory convergence, and step-level analysis rather than only final-answer scoring, per Confident AI's agent evaluation guide. The shift is the whole story: from grading the answer to grading the path.

If you are wiring these metrics into a real harness rather than reading about them, that is the work we do at Devlyn: hire AI engineers who instrument the trajectory before the agent ships, not after an incident.

Outcome vs process: you need both numbers

There are two honest questions to ask of an agent run, and they are not the same question.

Outcome eval asks: did the world end up in the right state? For a retail-support agent that means the order was actually canceled, the refund issued, the database row correct. The cleanest outcome check compares the final system state against an annotated goal state. This is exactly how the τ-bench benchmark scores agents: it inspects the database at the end of a conversation and compares it to the goal, rather than trusting the agent's closing message (Yao et al., 2024). State, not narration.

Process eval asks: was the path sane? An agent can reach the right outcome through a reckless path that happened to work this time. It can also fail the outcome through a perfectly reasonable path that hit a genuine ambiguity. If you only measure outcome, you reward luck and punish honest uncertainty. If you only measure process, you certify agents that look tidy and ship the wrong result.

The trap is real even at full tool-call accuracy. A customer-service agent can hit 100% tool-call correctness and still violate policy on an edge case, because correctness in that domain is contextual and defined by people, not by a metric (Confident AI). Process metrics caught the path; only an outcome-and-policy check catches the violation. Report both, weighted for your blast radius.

Outcome without process rewards luck. Process without outcome rewards tidiness. Ship neither alone.

Building an agent task set

The harness is only as honest as the task set behind it. The discipline mirrors what I argued in evals that predict production: sample from real traffic, freeze it, version it, and never let it grow organically. An agent task set adds one demand a chat eval does not have: each task needs a defined goal state, not just a reference answer.

A single task in the set has four parts:

Initial state. The starting fixture: the seeded database, the available tools, the policy the agent must follow.
The instruction. What the user asks, in real-traffic phrasing, including the messy and compound requests that break agents.
The goal state. The verifiable end condition. Which records should exist, which fields should hold which values, which actions are forbidden.
The known-good path. An optional reference trajectory, used to compute step efficiency and to make trajectory disagreement legible.

Build the set by stratifying real sessions: over-weight the runs where the agent's confidence was low, where a human corrected it, where a tool call failed, and where a past incident occurred. A uniform sample under-represents every hard case, and the hard cases are the ones that cost you money in production.

Then freeze it as a named artifact. The same rule applies as for any eval: your score on a frozen set can only go down, which is the point. You want a fixed ruler, not a rubber band. Cut a new version when you add tasks; never edit the old one.

Consistency: run each task more than once

Agents are stochastic, so a single pass tells you almost nothing about reliability. The metric that matters is whether the agent passes the same task on repeated, independent runs. τ-bench formalized this as pass^k: the probability that an agent succeeds on all of k independent attempts at a task.

The published numbers are sobering and worth internalizing before you trust a demo. On τ-bench, even strong function-calling agents succeed on under half the tasks on a single attempt, and consistency collapses under repetition: pass^8 falls below 25% in the retail domain (τ-bench, 2024). An agent that looks like a coin flip per run looks like a long shot once you demand it behave eight times running. Your customers experience the repeated-run distribution, not the lucky single demo.

So report pass^k for a k that reflects your real volume, not pass^1. If an agent handles a thousand cases a day, its single-run accuracy is the wrong headline. The honest trade-off: measuring consistency multiplies your eval cost by k. Run it anyway on the frozen set. Cheap evals that hide variance are how teams ship coin flips.

A realistic agent eval log

Here is an abbreviated run of a trajectory-aware harness against a frozen agent task set. The numbers are realistic but illustrative, not from a specific live system.

# agent eval run against frozen task set agent-set-2026-w24-v1.jsonl

python -m agent_eval.runner \

--suite agent-set-2026-w24-v1.jsonl \

--agent support-agent-2026-06-15 \

--passk 8

# results summary (n=160 tasks)

outcome pass@1 0.78 # goal-state match, single run

outcome pass^8 0.41 # all 8 runs succeed, the number that ships

tool select acc 0.93

tool arg valid 0.88 # right tool, wrong args is the gap

step efficiency 1.62 # 62% over known-good path, flag

recovery rate 0.71 # recovered from failed calls

loop/timeout 3.1% # threshold 2.0%, FAIL

policy violations 2 # human-gated tail, both escalated

verdict GATE BLOCKED # loop rate + 2 policy hits

Read that log and the design becomes obvious. Outcome pass@1 looks shippable at 0.78. The pass^8 of 0.41 tells the truth: under repeated runs this agent fails the task more often than not. The step-efficiency flag says it works too hard even when it succeeds, which is a latency and cost problem before it is a quality one. And the gate is blocked not on the headline accuracy but on the loop rate and two policy violations the tail review caught. That is the harness doing its job.

Human-gate the tail, not the whole queue

You cannot put a human on every agent run. You also cannot put a human on none of them. The pattern that scales is to gate the tail: automate scoring for the bulk of runs and route a small, deliberately chosen slice to a senior reviewer.

The slice is not random. Send a human the runs where outcome and process disagree, where the agent's own confidence was low, where a policy-sensitive tool was touched, and where the action was hard to reverse. That is the same blast-radius logic from why human-in-the-loop is not a plan by itself: the reviewer is a designed component with a defined trigger and a response budget, not a rubber stamp you bolt on after launch. A human reviewing 3% of runs by deliberate selection beats a human nominally reviewing 100% and actually reading none.

This is where engineering meets revenue. Every run you can safely auto-score is margin; every run you must route to a senior reviewer is cost. A good agent eval harness is, among other things, the instrument that tells you exactly how large that human-gated slice has to be, which is the number that decides whether the agent is profitable to run at all.

Where this connects to the rest of the stack

Agent evals are one specialization of a broader discipline. The frozen-set, blinded-rater, human-calibrated mechanics carry straight over from single-output LLM evaluation; agents just add the trajectory layer on top. And the agents worth evaluating are usually built as agentic workflows with bounded, tool-scoped steps, which is precisely what makes their trajectories legible enough to score in the first place. An agent you cannot trace is an agent you cannot evaluate.

FAQ

What are AI agent evals?

AI agent evals are tests that score an agent's full trajectory rather than only its final answer. They measure tool-call correctness, argument validity, step efficiency, recovery from failed steps, and whether the run reached a verified goal state. They run against a frozen, versioned task set so scores are comparable over time.

How is agent evaluation different from LLM evaluation?

LLM evaluation scores one output: the text a model returns. Agent evaluation scores a sequence of decisions: which tools the agent called, in what order, with what arguments, and whether it recovered when a step failed. Agents have many ways to fail mid-run while still ending on a plausible message, so you score the path and the end state, not just the message.

What metrics matter most in an agent eval framework?

The load-bearing metrics are outcome success against a known goal state, tool-selection and argument accuracy, step efficiency versus a known-good path, recovery rate, and loop or timeout rate. For reliability, report pass^k across repeated runs, not single-run accuracy, because production sees the repeated-run distribution.

How often should I run agent evals?

Run the frozen task set on every model swap, every prompt change, and every new tool you grant the agent, the same regression discipline you would apply to evals that predict production. Then sample live traffic continuously and route the low-confidence tail to review. Offline regression catches what you changed; online sampling catches the distribution shift you did not.

Can I just have a human review agent outputs instead?

Not at production volume. A human can spot-check a single answer, but no one reads a 40-step trace on every run when an agent handles thousands of cases a day. The pattern that works is human-gate-the-tail: automate scoring for most runs and route the low-confidence, high-blast-radius slice to a senior reviewer.

Build the agent harness before you widen its scope

If you are putting an agent into production, build the ai agent evals harness before you extend what the agent is allowed to do. Sample real tasks, define their goal states, freeze the set, score the trajectory, and report consistency under repeated runs. My book Agents That Actually Work covers the containment and tool-scoping side, and A Field Guide to Evals covers the harness mechanics in depth.

If you would rather have a team that builds agents with evaluation wired in from day one, that is the work we do: hire AI engineers who ship agents you can actually trust in production, with the trajectory evals to prove it.

How to Measure (and Reduce) Hallucination

Alpesh Nakrani — Thu, 21 May 2026 18:30:00 GMT

Hallucination is a property of language models to manage, not a bug you patch once and forget. You measure it as faithfulness, or groundedness, against a known source on a frozen evaluation set. Then you reduce it with three interventions that earn their keep: retrieval grounding, constrained decoding, and calibrated abstention. The order is not optional. You cannot reduce what you do not measure, and most teams skip straight to reduction with no ruler in hand.

I have watched a team ship a "fixed" model after a week of prompt-tweaking, with no number to show the fix worked. It didn't. The complaints just moved to a different question class. Measuring hallucination first would have told them that in an afternoon. This piece is the harness I trust for measuring hallucination, and the interventions that move the number versus the ones that are theater. If you want the wider discipline this sits inside, start with my guide to LLM evaluation; this article is the hallucination-specific cut of it.

You cannot reduce what you do not measure. Most hallucination "fixes" are prompt edits with no ruler behind them.

Key takeaways

If you read nothing else, these are the load-bearing claims:

Hallucination is the model guessing fluently when it should abstain. It is mechanical, not mystical, and it is measurable.
Measure faithfulness on a frozen set before you touch the model. Faithfulness is the fraction of claims in an answer that the source supports.
Retrieval grounding, constrained decoding, and calibrated abstention measurably help. Bigger prompts and "be accurate" instructions are theater.
Confident-and-wrong is the expensive failure. A model that abstains costs you a deflection; a model that fabricates costs you a customer.

What hallucination actually is, mechanically

A language model predicts the next token from a distribution. It does not look anything up by default. When the training data thins out around a fact, the distribution stays smooth and confident anyway, and the model emits the most likely-sounding continuation. That fluent guess is a hallucination. There is no internal alarm that fires when the model leaves the region it actually knows.

The mechanism gets worse because of how we score models. OpenAI's 2025 paper argues that hallucination persists largely because evaluation rewards it: binary scoring treats "I don't know" the same as a wrong answer, so a model that guesses outscores one that abstains (Kalai et al., arXiv 2509.04664). Their numbers are stark. An older model abstained on 1% of cases and was wrong on 75%; a newer one trained to abstain hit a 52% abstention rate with 26% errors, at similar accuracy. We trained the guessing in. That is the thing we now have to measure out.

How to measure hallucination: faithfulness and groundedness

Measuring hallucination means scoring an answer against a source it is supposed to be true to. The 2026 evaluation stack splits this into three checks, and they are not interchangeable (Braintrust, 2026):

Groundedness checks the answer against the specific retrieved passages you put in the context.
Faithfulness checks the answer against the full source text, catching claims that drift during summarizing or rewriting.
Factuality checks a claim against general world knowledge, with no provided source.

For any retrieval-grounded system, faithfulness is the metric I gate on. RAGAS defines it concretely: break the answer into atomic claims, then divide the number of claims the context supports by the total claims in the answer (Ragas docs). A faithfulness of 0.78 means roughly a fifth of what the model asserted was not in the source it was handed. That is your hallucination rate, expressed as a number you can track.

The hard rule: measure faithfulness on a frozen, production-sampled set, not a set you keep editing. I make the full case for freezing the ruler in my essay on evals that predict production. A frozen set can only score lower over time, which is exactly what makes it honest about hallucination. If your hallucination number only ever improves, you are measuring your willingness to edit the test.

Detection methods that work in production

You cannot have a human read every answer for fabrication. That does not scale, and I argue why in why a human in the loop is not a plan. Detection has to be automatic and continuous. Three methods carry the load in 2026.

Claim extraction plus verification. This is the engine under faithfulness scoring. A judge model breaks the answer into atomic claims, then verifies each one against the retrieved context, and you report the supported fraction. It is the most direct signal because it tells you which claim hallucinated, not just that something did. The cost is real: it is several extra model calls per answer, so you sample in production rather than scoring every request.

LLM-as-a-judge against a rubric. A strong model grades the answer for groundedness on a fixed scale. It is cheaper than full claim extraction and good for trend monitoring. The catch is that a weak judge under-detects contradiction, so faithfulness scoring is only reliable with a strong reference model behind it. I cover when to trust the grader in when to trust LLM-as-a-judge.

Self-consistency sampling. Generate an answer several times at nonzero temperature and measure agreement. When a model knows a fact, the samples converge; when it is hallucinating, they scatter. Disagreement across samples is a cheap, model-internal signal of low confidence, and it needs no source document to compute.

Here is what a faithfulness run looks like coming out of an eval runner, scored against a frozen set. The numbers are realistic, not from a specific live system.

# faithfulness run, frozen set halluc-set-2026-w24-v1.jsonl

python -m eval.faithfulness \

--suite halluc-set-2026-w24-v1.jsonl \

--judge strong-ref-2026-06 \

--mode claim-extraction

# summary, 500 production-sampled cases

faithfulness 0.88 # supported claims / total claims

unsupported claims 61 # across 42 of 500 answers

abstention rate 7.4% # model declined to answer

confident-wrong 3.1% # asserted, unsupported, no hedge

verdict FLAG # confident-wrong over 2.0% budget

The line that matters is the last data row. Confident-and-wrong, the cases where the model asserted something the source did not support and gave no hedge, is the failure that costs money. A 0.88 faithfulness looks fine in a deck. A 3.1% confident-wrong rate is the number that should hold the deploy.

Interventions that measurably help vs. theater

Once you have a faithfulness number, you can tell a real intervention from a comfortable one. The field shifted in 2026 from chasing a truthful model to building a truthful system: retrieval, validation, calibration, and structural constraints around a base model that will always guess if you let it (Lakera, 2026). Three interventions move the number.

Retrieval grounding. Give the model the source instead of its memory. Retrieve from a verified knowledge base, rerank with a cross-encoder, and set a similarity threshold that filters weak chunks before they reach the prompt. Grounding works because it converts a factuality problem, which the model fails, into a faithfulness problem, which you can score and gate. The trade-off is that bad retrieval grounds the model in the wrong passage, so retrieval quality becomes your new hallucination surface. I cover that decay in why RAG pipelines fail in month three.

Constrained decoding. Force the output into a schema with a JSON grammar, and require a citations array and a confidence field for every claim. This kills structural hallucination outright: the model cannot invent a field or cite a document that is not in the allowed set. It is the cheapest high-leverage fix because it is a decoding constraint, not a model change.

Calibrated abstention. Let the model say "I don't know" and reward it for doing so. Set a confidence threshold below which the system abstains or escalates, and measure abstention as a first-class outcome, not a failure. This is the direct answer to the incentive problem: if your eval gives credit for a well-placed refusal, you stop training the guessing back in (Kalai et al.).

The theater list is shorter and louder. Adding "be accurate and do not make things up" to the system prompt does not move faithfulness; the model had no mechanism to comply. Stacking few-shot examples of good answers does not teach the model where its knowledge ends. Raising the model size buys you fluency, which makes hallucinations harder to spot, not rarer. None of these survive contact with a frozen faithfulness set, which is precisely why you measure first.

Intervention	What it fixes	Helps or theater
Retrieval grounding + reranking	Model lacks the fact	Helps
Constrained decoding (schema + citations)	Invented fields and sources	Helps
Calibrated abstention	Guessing when unsure	Helps
"Be accurate" prompt instructions	Nothing measurable	Theater
More few-shot examples	Style, not knowledge boundary	Theater
Bigger model alone	Hides hallucination behind fluency	Theater

Calibrated abstention is the whole game

A model that abstains when it does not know is worth more than a model that is slightly more accurate and never hedges. Abstention turns an unbounded risk into a bounded cost. The system says "I can't answer that from the sources I have" and routes to a human or a fallback. That is a deflection you can price. A fabricated answer delivered with confidence is a liability you cannot.

Calibration means the model's stated confidence matches its real accuracy. When it says 90% sure, it should be right about 90% of the time. You measure this by bucketing answers by stated confidence and checking accuracy within each bucket. A well-calibrated model lets you set one threshold and trust it. An overconfident model makes every threshold a gamble, which is the state most untuned models ship in.

Confident-and-wrong is the expensive failure. Abstention costs you a deflection; fabrication costs you a customer.

Here is the revenue tie, in one line a CRO understands. A support agent that abstains on 8% of tickets sends them to a human at a known cost per ticket. A support agent that confidently fabricates a refund policy on 3% of tickets creates a chargeback, a complaint, and a churned account. The first is an operating expense. The second is an uncapped loss with your brand attached. Measuring and budgeting the confident-wrong rate is the difference between an AI feature you can underwrite and one you are quietly gambling on.

This is also where measuring hallucination stops being an eval exercise and becomes a continuous-monitoring problem. Faithfulness and confident-wrong rate drift as your corpus and traffic change, so they belong on a dashboard, not in a one-time spreadsheet. That is squarely an AI observability and monitoring job, scored on live traffic against a frozen reference.

Frequently asked questions

How do you measure hallucination in an LLM? Score the answer against its source on a frozen evaluation set. For retrieval systems, measure faithfulness: the fraction of claims in the answer that the provided context supports. Break the answer into atomic claims, verify each against the source, and report the supported ratio as your hallucination rate.

What is the difference between faithfulness and factuality? Faithfulness checks an answer against a specific source you provided, so it asks "is this true to the document." Factuality checks a claim against general world knowledge with no source, so it asks "is this true at all." For RAG and grounded systems you gate on faithfulness, because it is the property you actually control.

Can you eliminate LLM hallucination completely? No. Hallucination is a property of how language models generate, not a bug with a final patch. You manage it down with retrieval grounding, constrained decoding, and calibrated abstention, and you keep measuring it because the rate drifts as your data and traffic change.

What is the most expensive hallucination failure? Confident-and-wrong: the model asserts something false with no hedge. Abstention is cheap because it routes to a fallback at a known cost. A confident fabrication is an uncapped loss, because it ships to a customer as if it were verified.

If you want the full harness this plugs into, including frozen sets, judge calibration, and abstention budgets, my book A Field Guide to Evals walks through it end to end, and Observability for AI covers monitoring hallucination on live traffic. For why a frozen ruler is the whole point, read my essay on evals that predict production. And if you would rather have a team instrument faithfulness and confident-wrong rate in your stack from day one, that is exactly what Devlyn's AI observability and monitoring work is for. Measure it first. Then reduce what the number tells you to.

An honest accounting of what agents can do today

Alpesh Nakrani — Wed, 20 May 2026 18:30:00 GMT

There is a version of this essay I could write that would make everyone feel good. I could tell you that autonomous agents are about to transform every knowledge-work function, that your competitors are deploying fleets of them right now, that the window to act is closing. People write that essay every week on LinkedIn. I am going to write the other one.

I have spent the last two years deploying production AI systems, not in controlled demos, not in side projects, but in real operational contexts where failures land on real people. At Devlyn we run AI-assisted workflows across customer service, operations, and product. Before that, as CTO and COO of earlier companies, I was responsible for the systems that kept things running. What I know about agents comes from the full arc: the promising prototype, the embarrassing incident, the slow grind of making something reliable enough to trust with real work.

The honest truth about agents in mid-2026 is that they work in a narrow band. That band is real and valuable, I am not writing a pessimist's screed. But it is narrower than the demos suggest, and understanding its contours is the difference between deploying something useful and deploying something that silently fails in ways you will not catch until the damage is done.

What "agent" actually means in production

The word agent has become almost meaningless from overuse. Let me define what I mean. An agent is a system where a language model is given a goal, a set of tools, and the ability to decide what to do next based on what it sees. It is not a single prompt-response pair. It loops. It calls tools, observes results, updates its plan, calls more tools. It can run for minutes or hours. It might send emails, write files, call APIs, escalate to humans.

That multi-step autonomy is both the point and the problem. The moment you let a model decide its own next action, you have introduced a control surface that behaves differently from anything you have shipped before. Traditional software fails in ways that are mostly predictable, a missing field, an unhandled exception, a timeout. Agents fail in ways that are surprising. They confidently pursue the wrong goal. They get stuck in loops. They hallucinate tool outputs and proceed as if the hallucination were real. They take an action that cannot be undone and then, when they realize it was wrong, report success anyway because they have no mechanism to surface the error honestly.

I am not saying this to scare you off agents. I am saying it because every deployment decision downstream of this depends on understanding the failure mode. The narrow band of tasks where agents genuinely earn their keep is defined almost entirely by how well you can contain and recover from that failure mode.

The narrow band: what actually works

The tasks where I have seen agents deliver consistent, trustworthy value share four characteristics. They are bounded, reversible, verifiable, and tool-scoped.

Bounded means the task has a clear starting state and a clear ending state. Not "handle our support queue", that is unbounded and perpetual. But "triage this batch of 200 support tickets and draft a prioritized response queue, flagging anything that matches our refund policy for human review", that is bounded. The agent runs, produces output, and stops. You look at the output. You decide what to do with it.

Reversible means that if the agent made a mistake, you can undo it without catastrophic downstream consequences. Drafting is reversible. Sending is not. Updating a record in a staging environment is reversible. Posting to a production database that triggers billing is not. I have a rule of thumb that I apply ruthlessly: if an agent action cannot be undone in under five minutes by a senior engineer without external dependencies, it is not a candidate for autonomous execution. It either needs explicit human approval, or it should not be in the agent's tool scope at all.

Verifiable means you can check the output mechanically, not just by reading it and thinking it seems fine. For a code generation agent, the test suite runs and passes or fails, that is verification. For a data extraction agent, the output schema is validated and the records cross-checked against known totals. For a document summarization agent, I can do spot checks and compare against human-generated summaries. "Seems good" is not verification. Vibes are not evals. If you cannot write a check, you are flying blind.

Tool-scoped means the agent has access to exactly the tools it needs and nothing more. This is the principle of least privilege applied to AI systems. An agent that summarizes documents should not have write access to the document store. An agent that drafts email responses should not have a send key. An agent that queries a database for analytics should be connected to a read replica, not the primary. This is not just about security, though it is also about security. It is about limiting blast radius. When the agent does something unexpected, and it will, you want the set of possible consequences to be small and reversible. My book Agents That Actually Work: The narrow band where autonomy earns its keep goes deep on this framework, with production examples and the failure modes that shaped each principle.

The tasks where agents earn their keep are bounded, reversible, verifiable, and tool-scoped. Remove any one of those four properties and you are no longer in the narrow band. You are in the territory where agents fail quietly and you find out later.

Concrete examples from my own deployments. Data extraction from unstructured documents, PDFs, emails, scanned invoices, works well when you validate output against a schema and have a fallback path for low-confidence extractions. First-pass triage of support tickets against a known taxonomy works well when humans review the triage output before it triggers any downstream action. Generating draft responses to common inquiry patterns works well when a human reads and edits before sending. Automated monitoring that surfaces anomalies and writes a structured incident report works well when it is advisory, not responsive, it tells you something might be wrong, it does not try to fix it.

What does not work well in my experience: long-horizon research tasks where the agent must maintain a coherent goal across many steps and many hours. Planning tasks where the goal itself is ambiguous and the agent must clarify requirements before proceeding. Anything where the agent needs to negotiate with external parties, customers, vendors, partners, because the relational and reputational stakes are too high and the agent has no real sense of them. And anything involving irreversible financial or legal actions, full stop, no exceptions.

Memory: the part everyone gets wrong

The most underrated problem in agent deployment is memory. When a language model handles a single conversation, memory is not an issue, the entire conversation is in context. When an agent runs across multiple sessions, or when you are operating multiple agents that need to share state, or when an agent needs to remember something from a run three days ago to make a good decision today, you need an explicit memory architecture. And most teams treat this as an afterthought.

I have seen agents fail badly because of memory design errors. An agent that is supposed to escalate issues that have been open for more than 48 hours does not escalate them because it has no reliable way to know when an issue was first opened, the timestamp is stored in a format it interprets inconsistently. An agent that is supposed to avoid sending a follow-up message to a customer who already received one today cannot check that reliably because the sent-message log is in a different system and the agent's tool for querying it does not handle timezone boundaries correctly. These are not exotic edge cases. They are the kinds of things that fail in the first week of production use.

Good memory architecture for agents involves three things: a clear distinction between working memory (in-context, ephemeral), episodic memory (structured logs of what happened in past runs), and semantic memory (facts and preferences that should persist). Working memory is handled by prompt design. Episodic and semantic memory require deliberate infrastructure choices, where the data lives, how it is indexed, how the agent retrieves it, and how stale entries are managed. Memory Systems for Agents is the most rigorous treatment of this I have found, it changed how I think about agent persistence and is required reading for anyone building multi-session workflows.

The practical implication: before you deploy any agent that needs to reason about its own history, audit every place it needs to recall something. Ask: where does that information live? How does the agent access it? What happens if the access fails? What happens if the information is stale? If you cannot answer those questions cleanly, you are not ready to ship.

Observability: knowing why the agent did that

Here is a question I ask every team that tells me they are deploying agents in production: when the agent does something unexpected, how long does it take you to figure out why?

For most teams, the honest answer is "too long." They have logs of what the agent did, tool calls, outputs, final results. They may not have a clean trace of the reasoning that led from input to decision. They cannot answer "the agent sent the wrong escalation, what did it see that made it think escalation was appropriate?" without significant forensic work.

This matters more than most people think, and not just for debugging incidents after the fact. If you cannot inspect an agent's reasoning process during a run, you cannot intervene intelligently. You are reduced to watching the outputs and hoping they are acceptable, which is not a posture you can maintain in a production system for anything consequential.

What I consider the minimum viable observability stack for an agent in production: structured traces of every reasoning step and tool call, with inputs and outputs; latency and cost attribution by step; a mechanism to tag runs as "interesting" for later review; dashboards that surface behavioral drift over time, not just whether the agent succeeded, but whether its approach is changing in ways that might indicate model updates or data distribution shifts upstream; and human review queues that sample a percentage of runs for spot inspection regardless of outcome. The discipline I have found most useful here treats observability for AI systems as a first-class concern, covering tracing, evaluation loops, and the org practices that make observability useful rather than theatrical.

If you cannot answer "why did the agent do that?" within ten minutes of being asked, you do not have production-grade observability. You have logging. Those are different things.

We do post-mortems on significant agent incidents the same way we do post-mortems on infrastructure incidents. Not blame-focused, but causally rigorous. What did the agent see? What decision did it make? Was the decision reasonable given what it saw, or is there a reasoning error we need to address? Is the issue in the model, in the tool, in the prompt, in the memory system, or in our evaluation rubric? You cannot run this process without observability infrastructure. You end up guessing, and guessing is how you end up with the same incident again next month.

Human escalation as first-class design

There is a version of agent design that treats human escalation as a failure mode, something that happens when the agent cannot handle a case, a regrettable exception. This is wrong. Human escalation is a feature, and in my experience the teams that build it in as a first-class design element get to reliable production systems faster than teams that treat it as an edge case.

At Devlyn, we have a principle we call "senior owns production." It means that regardless of how much autonomy we extend to any automated system, a senior person is always in the loop for anything that could affect a customer relationship or a material business outcome. The agent is not the decision-maker for those cases. The agent is the triage layer, it handles what it can handle well, and it routes everything else to a human with enough context for that human to act quickly.

This design requires two things that teams often skip. First, the agent needs to know when to escalate, not just when it encounters an error, but when it encounters a situation where its confidence in its own reasoning falls below a threshold. This requires the agent to have calibrated self-assessment, which is a real capability that you have to build and test explicitly. You cannot assume it. Second, the escalation path needs to actually work. There has to be a human available to receive the escalation, a mechanism to get them the right context, and a response time expectation that the agent can plan around. If the escalation queue backs up, the agent is stuck or makes a decision it should not have made. The human handoff is part of the system, and it needs to be designed and monitored like any other part of the system.

I have zero tolerance for "we will figure out human escalation later." Later means after something goes wrong. That is a choice, and it is a bad one.

No fantasy timelines

I want to close with something that feels almost embarrassingly basic, and yet I keep seeing it violated in practice. Do not make commitments about agent capabilities based on demo performance. Demos are curated. Production is not curated. A demo runs a task in the most favorable conditions the presenter can engineer. Production runs the task when the input is slightly malformed, when the external API is slow, when the context has grown large enough to degrade reasoning quality, when the model update last week changed a behavior you depended on without anyone announcing it.

The gap between demo performance and production reliability for agents is the largest I have seen for any category of software in twenty years. It is not because the technology is fraudulent. It is because agents are genuinely sensitive to distribution shifts in a way that traditional software is not, and because the evaluation infrastructure needed to catch regressions before they reach production is still immature and under-invested in most organizations.

The teams I respect most are the ones who build boring-looking agents that do narrow things well, invest heavily in evaluation and observability, extend the scope of autonomy only when the evals support it, and refuse to pretend otherwise. They are not the teams with the best demo reel. They are the teams with the best track record of things not breaking in front of customers.

The narrow band is real. Tasks that are bounded, reversible, verifiable, and tool-scoped, these are places where agents earn their keep. Good memory architecture keeps agents coherent across sessions. Good observability tells you why the agent did what it did. Least-privilege tool scopes limit blast radius when things go wrong. Human escalation handles the edges that no system should handle autonomously. These are the foundations. Everything else is detail.

Start there. Build the evaluation infrastructure before you extend the scope. Do not make promises about timelines based on demos. Keep senior people in the loop on anything that matters. And be honest, with your team, your stakeholders, and yourself, about where you actually are in the journey.

That honesty is not pessimism. It is the prerequisite for building something real.

Frequently asked questions

What can AI agents actually do reliably today? They earn their keep on tasks that are bounded, reversible, verifiable, and tool-scoped: data extraction from unstructured documents, first-pass triage against a known taxonomy, draft generation a human reviews before sending, and advisory monitoring that surfaces anomalies. Remove any one of those four properties and you have left the narrow band.

What can agents not do well yet? Long-horizon research that demands a coherent goal across many hours, planning where the goal itself is ambiguous, negotiating with customers or vendors where relational stakes are high, and anything involving irreversible financial or legal actions. Those are not edge cases to engineer around; they are the boundary of the narrow band.

What does it take to deploy an agent safely in production? Evaluation infrastructure before scope, explicit memory architecture, observability that answers "why did the agent do that?" in minutes rather than hours, least-privilege tool scopes, and human escalation designed as a first-class feature rather than a failure mode. If you want help building that foundation, that is the work my team does at Devlyn.

The spec is the program now

Alpesh Nakrani — Tue, 19 May 2026 18:30:00 GMT

There is a line I keep coming back to, one I first started sketching in The Spec Is the Program: Intent as the new source code, and it goes like this: when the machine writes the implementation, the thing you wrote before the machine runs becomes the thing that matters most. Not just a planning artifact. Not a ticket. The actual source of truth you version, review, and defend.

I have been operating inside that reality at Devlyn for long enough now that I want to say something concrete about it, not in the abstract sense of "AI is changing software," but in the very specific sense of what my senior engineers do differently on a Monday morning than they did three years ago. What they argue about. What they reject. What they own.

The answer, compressed: they own the spec. And the spec is the program now.

Key takeaways

The spec is the artifact you version and defend. When the model writes the implementation, the specification becomes the source of truth you review, version in git, and read first in a postmortem. The code is a projection of it.
You author constraint, not code. A skilled engineer writes structured intent, the model produces a diff, and the engineer evaluates that diff against the spec, then tightens and regenerates. Writing precise constraint is harder than writing the implementation directly.
Specs for model implementers must be explicit. The model fills any gap you leave with something plausible but not necessarily correct, so a good spec states intent, happy path, edge cases, invariants, and what NOT to build.
Test invariants, not mechanism. Behavior-coupled tests stay green across model-generated rewrites and actually mean something when they fail; mechanism-coupled tests break every regeneration.
Spec drift is the new technical debt. When code quietly becomes the implicit new spec, divergence compounds one model-run at a time. The defense is to treat the spec as primary and assume it is right when code disagrees.

What "the machine writes it" actually means in practice

Let me be precise about the claim, because it gets sloppy fast. When I say the machine writes the implementation, I do not mean a junior engineer pastes a vague prompt into a chat window and ships whatever comes out. That is not what is happening in our senior-engineer layer and it is not what I am describing.

What I mean is this: a skilled engineer writes a structured artifact, call it a spec, a design, a declaration of intent, and then a code-generating model produces a diff from that artifact. The engineer does not write the for-loops. The engineer does not write the SQL. The engineer writes the thing that governs the behavior, and the model fills in the mechanism. Then the engineer reads the diff, evaluates it against the spec, and either accepts it or tightens the spec and runs again.

That loop, write intent, generate mechanism, evaluate against intent, tighten, is the new cycle. And the thing that has changed is what the engineer is actually authoring. They are not authoring code anymore. They are authoring constraint.

This sounds cleaner than it is. Writing precise constraint is hard. It is, in many ways, harder than writing the implementation directly, because when you write the implementation you can cut corners in your own head. When you write the spec, you have to externalize those corners. The model will find every gap you leave.

The spec as executable artifact

Here is the part that changes how you think about software development. In the old model, the spec was upstream of the code and the code was upstream of the tests. The spec lived in a document somewhere, maybe a Confluence page, maybe a JIRA epic. The code was the thing that ran. The spec was aspirational at best, archaeological at worst, something you read to understand what someone intended six months ago before the implementation drifted.

In the new model, the spec is the thing that runs, not directly, but in the sense that the model will faithfully implement whatever the spec says, including the parts you did not think carefully about. The spec becomes load-bearing. It is no longer aspirational. It is operative.

That is a genuinely different relationship with documentation. When a document can cause code to exist, the document is no longer soft. It has to be precise about behavior, not just intent. It has to specify not just what the system does in the happy path, but what it does at the edges. It has to encode the invariants, the things that must always be true, not just the features.

When a document can cause code to exist, the document is no longer soft. It has to be precise about behavior, not just intent.

At Devlyn, we have started treating specs with the same rigor we used to reserve for schemas and contracts. We review them. We version them in git alongside the code they generate. We have spec review as a step in our engineering process, not as a precursor to the real work but as the real work. When someone proposes a change to a system, they submit a spec change first. The code change is downstream of that, and often it is generated.

What a good spec looks like when the model is the implementer

This is where it gets concrete enough to be useful. A spec that is written for human implementers is different from a spec that is written for model implementers, and not in the direction most people expect.

A spec for a human implementer can leave a lot implicit. The human brings contextual judgment. They know the codebase. They will fill in reasonable defaults. They will notice when something seems off and ask. The spec can be high-level and still produce a good implementation because the human bridges the gap between intent and mechanism.

A spec for a model implementer needs to be explicit about things that human implementers would infer. Not because the model is dumb, it is not, but because the model will produce a plausible implementation for any gap you leave, and "plausible" and "correct" are not the same thing. The model does not push back. It fills in. That filling-in is where bugs are born in AI-assisted development, not in the code the model writes for things you specified, but in the code the model writes for things you left unspecified.

So a good spec, in our current practice, has four layers:

# Spec: Order eligibility check ## Intent Determine whether a customer order can proceed to fulfillment given current inventory state and any active holds. ## Behavior (happy path) Given a valid order and no holds, return `eligible: true` with a fulfillment window based on warehouse proximity. ## Edge cases / constraints - If any line item is out of stock: return `eligible: false`, reason: `inventory`, include which SKUs are blocked. - If account has active fraud hold: return `eligible: false`, reason: `hold`, do NOT expose hold details to client response. - Partial availability (some items in stock): eligible: false unless customer has opted into partial fulfillment. - Timeout from inventory service: fail open toward ineligible, log for async review, never fail silently. ## Invariants (must always be true) - Fraud hold reason must never appear in external API response. - Response time must not depend on number of line items (O(1) calls). - All eligibility decisions must be logged with order ID + reason. ## What NOT to build - Do not add caching at this layer; caching lives upstream. - Do not call pricing service; eligibility is inventory + holds only.

That last section, "what not to build", turns out to be one of the most important parts of any spec written for a model implementer. The model has strong priors about what a system like this "usually" includes. It will add things. You have to explicitly exclude the things you do not want, not just specify the things you do.

Accepting machine-authored diffs is a new engineering skill

There is a skill that nobody trained for that is now table stakes for senior engineers working in this mode, and it is the skill of reading a machine-authored diff with the right mental posture.

The wrong posture is to read the diff like you would read code you wrote yourself, looking for syntax errors and obvious bugs. The model does not produce syntax errors. It produces plausible code that looks right. The errors are at a higher level of abstraction, a behavior that is slightly off, an invariant that is almost but not quite maintained, a happy-path implementation that does not handle the one edge case that the spec mentioned but did not fully specify.

The right posture is to read the diff against the spec, clause by clause. Not "does this code look correct" but "does this code implement what the spec says." Those are different questions. The first is a code review. The second is a spec audit. We have shifted almost entirely to the second.

This is also why I believe strongly in properties-based thinking, what I sometimes call invariant-first specification. If your spec encodes the invariants, the things that must always be true, you can write tests that check the invariants, not the code paths. And those tests remain valid even after the model regenerates the implementation, because you are testing behavior, not mechanism. You are testing the thing that is supposed to be stable across implementations.

You are testing behavior, not mechanism. The thing that is supposed to be stable across implementations, not the implementation itself.

This matters more than it seems. In the old model, tests were tightly coupled to implementation. When you refactored, tests broke. Not because behavior changed but because the mechanism changed and the tests were testing mechanism. In the model-generated world, the mechanism regenerates constantly. If your tests are coupled to mechanism, you will be rewriting tests constantly. If your tests are coupled to behavior, to the invariants in the spec, they stay green across model-generated rewrites, and they actually tell you something when they fail.

Spec drift is the new technical debt

In traditional software development, the great enemy of long-term maintainability is code entropy: the codebase drifts from the original design, abstractions collapse, complexity accumulates. You accrue technical debt.

In the model-generated world, there is a new kind of entropy that I think is actually more dangerous, and I call it spec drift. It works like this: the spec says one thing, the generated code does something slightly different because of a gap in the spec, and then the gap quietly becomes the implicit new spec. People start writing the next spec against what the code actually does rather than what the original spec said. Over time, the spec and the code diverge in a different direction, the code diverges from the spec, and then the spec gets rewritten to match the code rather than the original intent, and then the next generation of code is generated from a spec that already incorporated one round of drift.

This compounds. Each generation of code incorporates the drift from all previous generations. Within a few months, you have a system that does something recognizable but not exactly what anyone intended at any point, a sort of evolutionary drift away from the original design, one model-run at a time.

The defense against spec drift is to treat the spec as the primary artifact and the code as a projection of it. When there is a discrepancy between spec and code, the default assumption should be that the spec is right and the code is wrong, not the other way around. When you change the spec, you should change it deliberately, with review, and you should record why. The spec is the commit message that explains the code, but more than a commit message: it is the document that future generations of the model will use to regenerate the code, so it has to stay accurate and intentional.

At Devlyn, we have started doing spec audits, periodic reviews where we read the spec for a system against what the system actually does and flag divergences. It takes time. It is worth it. It is how we keep the model-generated codebase aligned with the decisions that senior engineers made.

What senior engineers own now

I want to be direct about what this means for engineering roles, because there is a lot of hand-waving about whether AI replaces engineers and I find it mostly useless. The real question is: what does the senior engineer own in the model-generated world?

The answer, in our practice: they own the architecture and the spec. They own production readiness, the judgment calls about failure modes, observability, security posture, and operational behavior that cannot be derived from features alone. They own the invariants. They own the spec review process. They own the decision about when the model's output is good enough to ship and when the spec needs to be tightened.

What they do not own, or own only lightly: the line-by-line implementation. The first-draft code. The boilerplate. The translation from a well-specified behavior into a working function. The model owns that, under supervision.

This is not a diminishment of the senior engineer role. It is, if anything, a sharpening. The model takes away the implementation work that junior engineers used to do as practice and senior engineers used to do as tax. What is left is the work that required senior judgment all along, the design, the constraint, the production posture, except now that work is more visible and more load-bearing because it directly governs what gets built.

I think this is a better deal for senior engineers who embrace it. The frustrating parts of senior engineering, translating a clear design intent into hundreds of lines of boilerplate that mostly express the intent without adding any value, those parts compress. The rewarding parts, making hard architectural calls, writing down the invariants you care about, holding the line on production readiness, those parts expand.

The catch is that you have to actually write the spec. You cannot hold the intent in your head and expect the model to read it. You have to externalize the judgment. That is the discipline. And it is a real discipline. Writing a spec good enough to govern a model-generated implementation requires you to surface assumptions you would otherwise leave implicit, specify behaviors you would otherwise leave to the implementer's judgment, and enumerate the edge cases you would otherwise trust a senior engineer to catch.

The more I work in this mode, the more I think good spec-writing is actually a more rigorous form of engineering than good code-writing. Code lets you hide in the mechanism. A spec has to say what you mean.

Software development when the machine writes it

Step back from the individual engineer and look at what software development looks like when model-generation is the primary path from spec to code. It looks, in structural terms, like this:

The work of a software project is front-loaded into intent. The early phases of a project, what used to be architecture and design phases, often abbreviated or skipped in agile shops, become the primary creative work. The spec for each component is the thing that gets iterated on, reviewed, and refined. The code is generated downstream and evaluated against the spec.

Changes to a system start with the spec, not the code. You do not open the implementation file and start editing. You open the spec and change what you want the system to do. Then you generate a new implementation from the new spec. This is not always clean, there are cases where the generated implementation has been hand-modified and you have to reconcile, but the general direction of change should be spec-first.

Reviews are behavior reviews, not code reviews. The question is not "is this code well-structured" but "does this code implement the spec." The model can produce well-structured code that does not implement the spec. It can produce ugly code that does. You care about the spec compliance.

Testing is invariant-testing. Your test suite encodes the properties that must always hold. When the implementation is regenerated, the tests run against the new implementation. If they pass, the invariants hold. You ship. The tests do not care how the invariants are achieved; they care that they are achieved.

All of this is described at much greater length in The AI-Native SDLC, where I go through the full development lifecycle in the model-generated world, from initial design through incident response. But the core move is the one I have been describing here: shift what you author from mechanism to intent, and treat the intent artifact as the primary thing you version and defend.

The artifact you actually defend

There is a question I ask in engineering reviews at Devlyn: "If this system does the wrong thing in production, what document would you read first to understand why?" Three years ago, the answer was always: the code. Now the answer is: the spec. Because the spec is the thing that governed the behavior. The code is the thing that expressed it.

That shift in what you read first is, I think, the deepest change. It means the spec is the artifact you defend in a postmortem. It means the spec is the artifact you update when requirements change. It means the spec is the thing a new engineer reads to understand what the system is supposed to do, not just what it does.

We are still early in this transition. The tooling for managing specs, versioning them alongside code, tracking spec drift, and making spec review a first-class part of the engineering process, all of that is still being built, including by us. The practices are ahead of the tools. That is fine. It has always been fine. The practices are right.

The machine writes the implementation. The engineer writes the spec. The spec is the program now.

Frequently asked questions

What does "the spec is the program" mean?

It means that when a code-generating model writes the implementation, the specification, the structured artifact of intent you wrote before the machine ran, becomes the thing you actually version, review, and defend. The spec is no longer aspirational documentation that drifts from the code. It is operative, because the model faithfully implements whatever it says, including the parts you did not think carefully about. The code becomes a projection of the spec rather than the source of truth.

How is a spec for a model implementer different from one for a human?

A spec for a human can leave a lot implicit, because the human brings contextual judgment, knows the codebase, and asks when something seems off. A model does not push back; it fills any gap you leave with a plausible implementation, and plausible is not the same as correct. So a spec written for a model implementer has to be explicit about the things a human would infer: intent, the happy path, edge cases and constraints, the invariants that must always hold, and an explicit list of what NOT to build.

What is spec drift?

Spec drift is the new technical debt. The spec says one thing, the generated code does something slightly different because of a gap, and that gap quietly becomes the implicit new spec. The next spec gets written against what the code does rather than what was intended, and each generation of code incorporates the drift from all previous generations. The defense is to treat the spec as the primary artifact: when spec and code disagree, assume the spec is right and the code is wrong, and change the spec deliberately, with review.

Eval-Driven Development: The Test Suite Leads

Alpesh Nakrani — Mon, 18 May 2026 18:30:00 GMT

Eval-driven development is TDD for probabilistic systems. You write the eval before the prompt or model change, let a frozen eval set gate every deploy, and treat the eval suite as the source of truth the model is optimized against. It is the discipline that replaces test-driven development when the unit under test is a distribution, not a deterministic function.

I started taking this seriously after watching a prompt change ship on a teammate's eyeball check. The output looked better on the three examples he tried. In production it was worse on a class of inputs none of us had thought to type. There was no test to catch it, because the old habit said tests are for code with one right answer. An LLM call does not have one right answer. It has a distribution of answers, and you can only reason about it by sampling and measuring. That is what eval-driven development is for.

TDD asserts one correct output. Eval-driven development measures a distribution of outputs against a threshold you set first.

Key takeaways

Eval-driven development is TDD for probabilistic systems: write the eval before the prompt, gate every deploy on a frozen eval set, and treat that set as the spec the model is optimized against.
The unit of testing changes from a binary assert on one output to an aggregate score on a sample of cases versus a threshold you set before the run.
One run is not a measurement. At 100 cases and 80% accuracy a 95% confidence interval is roughly plus or minus 8 points, so green today can be red tomorrow with no code change.
Keep the gate cheap by running it as a pyramid: deterministic checks on every commit, a classifier-backed sweep on the regression set, and an expensive LLM judge on a sampled subset.
The eval set is a revenue asset, not hygiene. It is the only thing that lets you ship a model upgrade and prove, with evidence, that this week is not worse than last quarter.

How eval-driven development differs from classic TDD

Classic TDD works because a deterministic function has a single correct output for a given input. You assert add(2, 2) == 4, the test is binary, and a passing test today passes forever unless the code changes. That contract is the whole reason TDD is trustworthy. None of it survives contact with a language model.

An LLM call is stochastic. The same input can yield different outputs across runs, models, temperatures, and prompt edits. There is no single string to assert against; there are thousands of acceptable ones and a long tail of bad ones. A binary assertion on one example proves nothing about the system, because one sample tells you almost nothing about a distribution. This is the gap eval-driven development exists to close (Braintrust, eval-driven development).

So the unit of testing changes. Instead of one input and one expected output, an eval scores many inputs against a rubric and reports an aggregate: accuracy, faithfulness, a calibrated judge score, whatever maps to real failure. Instead of pass/fail on a string, you set a threshold before the run and the gate compares the aggregate to it. A single passing case is an anecdote. A score on a frozen set of 300 cases is a measurement.

The table below is the translation I keep in my head when moving an engineer from TDD to evals.

Classic TDD	Eval-driven development
One input, one correct output	Many inputs, a distribution of acceptable outputs
Binary assert (==)	Aggregate score vs. a threshold set first
Deterministic; flake is a bug	Stochastic; variance is expected and measured
One run is enough	Sample size and confidence interval matter
Green forever unless code changes	Decays as the world and the model drift

The workflow: eval first, then the prompt

The order is the whole point. In eval-driven development the eval comes first, before the prompt, before the pipeline, before you pick a model. You write down what good looks like as a scored test, then you build the thing that passes it. Same loop as TDD, different unit. The manifesto version of this is blunt: if your evals do not run on every change, they do not exist (evaldriven.org).

Here is the loop I run.

Write the eval first. Define the failure mode in a rubric and a metric before you touch the prompt. "This summary must not invent a number that is not in the source" is a testable claim. "Make it better" is not.
Freeze a golden set. Pull real inputs from production, version the set as an artifact, and never grow it casually. The set is your ruler; a ruler you keep editing measures nothing.
Set the threshold before the run. Every threshold needs a justification you wrote down first. Picking the bar after you see the score is rationalization, not evaluation.
Make the change, then run the suite. Edit the prompt or swap the model, run the frozen set, and read the aggregate against the gate.
Gate the deploy in CI. The eval runs next to lint and type-check. A change that drops the score past the threshold is blocked from merge, automatically.

That last step is where eval-driven development stops being a notebook habit and becomes engineering. OpenAI's own regression workflow stores completions per prompt version and compares runs so a prompt edit that degrades quality is caught as a regression before it ships (OpenAI Cookbook, detecting prompt regressions). The eval suite becomes a merge-blocking gate, the same role your unit tests play for deterministic code.

If your evals do not run on every change, they do not exist. Evaluation belongs in CI, not in a notebook someone opens quarterly.

Make the gate cheap enough to run on every change

"Run it on every change" sounds expensive, and naively it is. An LLM-as-a-judge call runs roughly 5 to 50 cents per case, so a 300-case suite of judged evals on every commit gets costly fast and slow enough that people start skipping it. A gate everyone bypasses is not a gate. The fix is to stop treating every eval as equally expensive.

I run the suite as a pyramid, the same shape as the test pyramid you already trust.

Every commit: deterministic checks. Schema validation, regex and string assertions, "did it cite a real source ID," refusal detection. These cost nothing, run in milliseconds, and catch the dumb regressions immediately.
Every PR: a classifier-backed sweep. Cheap learned scorers on the full regression set run at a fraction of a cent each, so you get an aggregate quality delta on every merge without paying judge prices.
Nightly or pre-release: the LLM judge on a sample. Reserve the expensive judged eval for the high-stakes rubrics on a sampled subset, and run it on a schedule, not on every keystroke.

This is the difference between a gate that ships and a gate that becomes theater. Cheap, fast, and statistically significant is a pick-two on any single tier, so you spread the three goals across the tiers instead of demanding all three on every commit. The deterministic tier buys speed, the classifier tier buys breadth, the judge tier buys depth, and the deploy is gated on the combination.

The eval set is the source of truth, so write it like spec

Once evals gate every deploy, the eval set quietly becomes the real specification. The prompt is just the current implementation that happens to pass it. Swap the model, rewrite the prompt, change vendors: the eval set is what carries the definition of correct across all of it. This is the same move I argue for in treating the spec as the source, applied to probabilistic systems. The evals are the executable spec.

That reframing has teeth. It means the eval set deserves more review than the prompt, because the prompt is disposable and the eval set is the asset. It means a vague rubric is a vague spec, and the model will optimize toward whatever the rubric actually rewards, including the parts you wrote sloppily. Evals are engineered, not generated. Every metric maps to a failure mode, every threshold has a reason.

It also changes what "the model is optimized against" means. When you tune a prompt to lift the eval score, you are fitting to the eval set. If that set is a faithful sample of production, you are improving the product. If it is a handful of cases someone imagined, you are overfitting to fiction and the production gap will find you. The quality of your evals is the ceiling on the quality of your system. I build the full version of this argument in Eval-Driven Development, and the harness it sits on in how to build an LLM evaluation framework.

Where eval-driven development is genuinely hard

I will not pretend this is free. Eval-driven development is harder than TDD in ways that are structural, not just unfamiliar, and anyone selling it as painless has not run it under deadline.

The first cost is statistics. One run is not enough, and neither is a small set. At 100 cases and 80% accuracy, a 95% confidence interval is roughly plus or minus 8 points, so two prompts that score 79 and 83 are statistically indistinguishable. Detecting a real 2-point improvement with confidence can take thousands of cases per arm, not a handful. Outputs also vary run to run, so a green eval today can be red tomorrow with no code change, which feels exactly like a flaky test and is not a bug you can fix. You handle it by sampling: run enough cases, aggregate, and report a confidence interval, not a single number. Pinning temperature to zero helps reproducibility but narrows what you measure. Doing this right means treating eval results as estimates with error bars, which most engineers were never trained to do (Cameron Wolfe, applying statistics to LLM evals).

The second cost is the judge. To score open-ended output at volume you usually need an LLM grading the LLM, and that grader has its own measurable failure modes. Position bias hands the first option in a pairwise comparison a 10 to 15 point edge. Verbosity bias rewards the longer answer at matched quality. Self-preference inflates a model's score on its own family's output. The judge is itself a probabilistic system that needs evaluating, so eval-driven development can recurse on you. The defensible mitigation is to calibrate the judge against blinded human labels and, for launch decisions, run an ensemble of judges from different model families so family-specific biases partly cancel. I cover when to trust it in my full guide to LLM evaluation; the short version is only trust the judge where it agrees with people.

The third cost is the ledger. The eval set decays. The world shifts, the model provider updates the underlying weights, and your frozen ruler slowly stops measuring the present. A frozen set is honest but goes stale; refreshing it costs real labeling time and risks resetting your baseline. There is no version of this where you write the evals once and walk away. That is the trade-off, stated plainly: eval-driven development trades a one-time test-writing cost for an ongoing measurement discipline you fund forever.

Why this is a revenue decision, not a hygiene one

The gate is cheaper than the churn. In a deterministic product a bad deploy is one bug and one rollback. In an LLM product a bad deploy is a quiet quality regression that erodes the trust the product is sold on, and you often cannot see it until renewal conversations get strange. The frozen eval set is the only thing that lets you say, with evidence, that this week's model is not worse than last quarter's on the exact inputs your customers send.

That sentence is worth money. It is the difference between shipping model upgrades with confidence and freezing on an old version because nobody can prove the new one is safe. Eval-driven development is what makes a probabilistic feature governable enough to keep selling. When generation is cheap, the durable advantage is being able to tell good output from bad at scale, and the eval suite is how you operationalize that judgment. The same harness powers A Field Guide to Evals and the production gating in my essay on evals that predict production.

FAQ

What is eval-driven development?

Eval-driven development is a methodology where you write an evaluation before changing the prompt or model, gate every deploy on a frozen eval set, and treat that set as the source of truth the system is optimized against. It is TDD adapted for probabilistic systems: instead of asserting one correct output, you score many outputs against a threshold you set before the run.

How is eval-driven development different from test-driven development?

TDD asserts a single deterministic output and one passing run is enough. Eval-driven development measures a distribution, because an LLM gives different outputs across runs and prompts. You score a sample of cases against a rubric, compare the aggregate to a pre-set threshold, and account for variance with sample sizes and confidence intervals rather than a binary pass or fail.

When should I use evals as tests instead of unit tests?

Use unit tests for the deterministic code around the model: parsing, routing, formatting, tool plumbing. Use evals for anything where the model's output is the thing under test and there is no single correct string. Most real LLM features need both, with the evals running in CI as a merge-blocking gate alongside the unit tests.

Does eval-driven development slow teams down?

It adds an upfront cost: writing the eval, freezing a golden set, and funding ongoing measurement as the set drifts. It removes a larger cost: shipping a regression you cannot see until a customer does. For a probabilistic feature you intend to keep changing, the gate is cheaper than the silent quality decay it prevents.

Isn't running evals on every commit too expensive?

Only if you run the expensive evals on every commit. Run the suite as a pyramid: free deterministic checks on every commit, cheap classifier-backed scorers on the regression set per PR, and the costly LLM judge on a sampled subset nightly or before release. That keeps the per-commit cost near zero while still gating the deploy on a real quality signal.

If you are adopting eval-driven development, start small: write one eval for your highest-risk feature, freeze a golden set from real traffic, and wire it into CI as a gate before you expand. When you want that discipline built into a team that ships probabilistic features for real, that is the work my Devlyn pods do, with AI engineers who treat the eval suite as the spec. Write the eval first, let the test suite lead, and trust the number over the demo.

Selling AI to people who have been burned by AI

Alpesh Nakrani — Sun, 17 May 2026 18:30:00 GMT

The market has a memory. After three years of vendor promises that collapsed on contact with reality, automations that broke under edge cases, "AI copilots" that required a junior engineer babysitting them full-time, pilots that lit up in demos and died in production, buyers have developed something that most sales teams treat as an obstacle. I think it is one of the most valuable raw materials in enterprise GTM right now: genuine, hard-won skepticism.

I run revenue at Devlyn, an AI-native engineering company. We sell technical delivery: senior engineers who work with AI tooling the way a craftsman works with a better set of tools, faster, yes, but with the same judgment and accountability. When I stepped into the CRO seat, the dominant wisdom in our space was to lead with AI's transformative potential. Speak to the board-level pressure. Promise acceleration. Show a demo. Close.

I tried a version of that posture for about six weeks. It produced a lot of second meetings and almost no signed contracts. When I actually listened to what prospects were saying, not the polite version they gave our AEs, but the unfiltered version that came out in the third conversation when trust had built up, the pattern was unmistakable. They had been burned. Not once. Repeatedly. And they had gotten good at sniffing out the same story being repackaged with a newer model name in the headline.

The instinct for most revenue leaders facing skepticism is to sharpen the pitch. Better slides. Stronger case studies. A more compelling ROI calculator. I went the other direction. I started leading with what we would not claim. That pivot changed everything.

Why demos lie by omission

The standard enterprise software demo is a controlled environment, a curated data set, a pre-seeded workflow, a deliberate avoidance of the edge cases that will make up forty percent of your buyer's actual use. AI demos amplify this problem by an order of magnitude. The model performs beautifully on the representative input. It hallucinates, drifts, or refuses on the inputs that reflect the buyer's real messy world. The buyer in the room doesn't see the messy world. They see the clean one. They sign. Three months later, they are that company, the one that burned the next buyer.

This is not malice in most cases. It is incentive structure. The person building the demo is optimizing for the meeting. The person who will own the implementation is not in the room. The gap between those two realities is where trust dies.

When I redesigned our sales motion, I insisted that every technical discovery call include the engineers who would actually work the account, not a solutions engineer performing delivery, but the people who would be in the code. This costs more in labor per deal. It closes at a significantly higher rate, compresses the post-close surprise curve, and produces clients who refer us without being asked. The math works.

The burned buyer's actual objections

Burned buyers rarely tell you directly that they don't trust you. They tell you adjacent things. "We're moving slowly on this." "We need more stakeholder alignment." "Can you send over references from clients in our exact vertical?" These are not bureaucratic speed bumps. They are diagnostic signals. The buyer is asking: Are you going to be the fourth vendor who overpromised?

When I trained our team to hear skepticism as a question rather than resistance, the entire conversation changed. Instead of overcoming objections, we started answering the real question underneath them. The real question is almost always some version of: "How do I know this will actually work in our environment, with our data, under our constraints, for our team that doesn't have the bandwidth to manage a vendor relationship that breaks?"

That question deserves a direct answer, not a case study. The case study is about someone else's environment. The direct answer is about yours. So we started saying: "We don't know yet. Here's the scoped pilot that tells us both."

The burned buyer's skepticism is not an obstacle to your deal. It is the most honest intelligence your sales process will ever receive about what the market actually needs.

A scoped pilot with defined success criteria is not a concession to a difficult buyer. It is the correct product. It forces specificity on both sides. It surfaces integration complexity early, when it is cheap to address. It builds the reference relationship that every subsequent deal in that vertical depends on. The pilot is not a trial period for the buyer to evaluate you. It is the period in which you both learn whether the engagement makes sense.

Lead with what you will not claim

Our go-to-market copy, our AE scripts, and our executive conversations now all open with constraints rather than promises. Verbatim: We won't oversell AI. We won't promise fantasy timelines. We won't trade quality for speed.

The first time I put that language in front of a skeptical CTO, he stopped me mid-sentence and asked me to repeat it. Not because it was unusual language in a vendor conversation, it was unusual because it sounded like something he would say to us rather than something we would say to him. That moment of alignment, when the buyer hears their own internal voice coming back at them from the vendor, is the beginning of a real commercial relationship.

This posture creates a specific kind of disqualification, and that disqualification is a feature. Buyers who need a vendor to validate a fantasy timeline, because they've already sold the board on an unrealistic schedule and need someone to co-sign the fiction, will not choose us. That is correct. We cannot deliver on a fantasy timeline without doing what every other vendor has done: overpromise, underdeliver, and add to the market's skepticism that then makes the next sale harder for everyone. The deals we turn away by being honest about scope protect the deals we keep by being honest about scope.

This is laid out in more depth in Selling AI to the Burned: Points of View, Volume I, which collects a year's worth of practitioner observations from revenue leaders navigating this exact posture shift. The through-line across all of them: constraint is a credibility signal when the market has been oversold.

Senior engineers only, and why that sentence closes deals

One of the cleanest differentiators in our pitch is also the simplest: Senior engineers only. No juniors hidden behind AI. We say this plainly, early, and we explain what it means in practice.

The AI hype cycle produced a specific pattern of vendor behavior: hire large numbers of junior engineers, buy them access to code generation tools, and present the output as senior-caliber AI-assisted work. The speed metrics can be impressive. The quality metrics, over time, are not. Burned buyers know this. They can often describe, specifically, the version of this pattern that failed them, the code that shipped fast and then required six weeks of remediation, the architecture decisions that got made at speed and then became the technical debt that blocked the next year's roadmap.

When we say senior engineers only, we are not just describing a hiring standard. We are making a claim about judgment. AI tooling amplifies whatever judgment it is attached to. A senior engineer using AI becomes faster without becoming less careful. A junior engineer using AI can produce volume that masks the absence of the underlying judgment. The client pays for speed in the short run and quality problems in the medium run.

The depth behind this distinction, what separates the CRO's read of the market from the CTO's, is something I have spent a lot of time thinking through in Revenue, Re-Engineered: What a CRO sees that a CTO can't. The short version: the CTO optimizes for technical excellence. The CRO has to optimize for the client's ability to trust, absorb, and advocate for the work. Those are not the same optimization target. Senior engineers help align them.

Ownership over hours. Outcomes over velocity.

The dominant pricing and value narrative in technical services for the past decade has been hours. Utilization. Time-and-materials. Even retainer structures are often mental-modeled as "how many hours am I buying." AI should be disrupting this entirely, and largely hasn't yet, because vendors found it easier to charge the same way and just produce more hours of output per engineer.

We price on outcomes. That is not a novel concept, but it is a genuinely rare practice because it requires the vendor to be willing to absorb scope risk, and it requires the client to be willing to define, specifically, what success looks like. Both sides find this uncomfortable. The vendor has been trained to protect margin through scope limitation. The client has been trained to protect scope through contract language. The result is contracts that are long on process and short on accountability for results.

The posture we take: Ownership over hours. Outcomes over velocity. We want to own something, a product function, a system, a transformation, and we want to be measured on whether it works. This is only possible when the senior engineer relationship runs deep enough that we actually understand the client's definition of working. Surface-level engagements cannot do this. They do not have the context.

AI does not make the question of what success looks like easier. It makes it more urgent. Speed without definition is just faster failure.

The burned buyer, specifically, has often been burned on velocity. They were sold speed and got chaos. When we reframe value around ownership and outcomes, we are directly addressing the wound. We are saying: the thing that failed you was not that the previous vendor moved too slowly. It was that they moved fast in the wrong direction and nobody took ownership of the direction being right.

Building a repeatable playbook for the skeptical market

Here is what the CRO playbook actually looks like when you operationalize this posture, step by step, in a real sales process.

First conversation: No demo. Discovery only. The goal is to understand where the buyer has been burned and what their internal definition of "this actually worked" looks like. Most vendors skip this because demos are compelling and discovery is vulnerable, it requires admitting you don't know the answer yet. Skip the demo. Do the discovery. You will close more and lose less post-close.

Second conversation: Bring the technical lead. Not a pre-sales engineer, the actual engineer who would own the work. Let the buyer's technical team run the conversation. What comes out of that conversation is more valuable than any slide deck: you learn the real constraints, the political landmines, the prior failures that are still in the room emotionally. Your pitch in the third conversation is entirely shaped by what you learn here.

Pilot design: Define success in writing before work begins. Not vague success ("the system performs well") but specific and measurable ("the extraction accuracy exceeds ninety-two percent on the validation set we have both agreed on, within eight weeks"). Commit to it publicly. If you cannot hit it, say so before the pilot starts, not after. The burned buyer has heard "we'll get there" too many times. Written, specific, pre-agreed success criteria are the closest thing to trust in an early-stage vendor relationship.

Evals over promises: Where the work is AI-specific, build evaluations into the engagement from day one. Not as a QA layer at the end, as a continuous signal that the model is performing, that the outputs are in range, that the thing you promised is the thing being delivered. Evals are the AI-native equivalent of tests. They are also the antidote to the demo problem: the demo shows you the clean world, the eval shows you the messy one. Share them with the client. Transparency here is a competitive advantage because most vendors do not offer it.

Reference architecture: After a successful pilot, codify what worked into a reference architecture the client team can maintain and extend without you. This sounds like it reduces future revenue. It actually expands it, because the client trusts you with larger, more complex work when they know you are not creating dependency by hoarding knowledge. The burned buyer has often been burned by this too, by vendors who built complexity they could not explain, creating lock-in through obscurity rather than value.

The hype cycle analysis in First Principles for a Hype Cycle makes this case from a structural angle: markets that overshoot on claims always mean-revert to proof, and the vendors who are already operating at proof when the mean-reversion happens capture disproportionate share. We are in that mean-reversion now. The buyers who were burned are now the market. Selling to them honestly is not a moral posture, though it is that too. It is a correct reading of where the market is and where it is going.

The long-term structural advantage

There is a compounding dynamic that takes twelve to eighteen months to show up clearly but is decisive when it does. Honest vendors, operating on proof rather than promise, build a reference base of clients who were not oversold. Those clients become the most powerful sales asset in the market: they can describe, specifically, what worked and why, without qualification or asterisk. They can take a skeptical peer's call and say not "it was transformative" but "here is exactly what we measured before and after, here is the pilot structure we used, here is what broke and how they fixed it."

That kind of reference is not available to vendors who oversold and then heroically salvaged something. The heroic salvage story contains an acknowledgment of the original failure. The skeptical buyer hears the failure more than the recovery. The clean reference, from a buyer who was accurately sold, accurately delivered to, and never had to absorb a crisis, is the cleanest asset in enterprise sales. It is built only one way: by not overselling in the first place.

I have watched three years of the AI hype cycle from multiple seats, engineering leadership, operations, and now revenue. The pattern is consistent: companies that sold the fantasy got the early deals and are now managing a reputation problem. Companies that sold the proof got fewer early deals and are now managing a referral engine. The market is sorting. If you are reading this as a revenue leader trying to figure out whether to sharpen the pitch or to constrain it, the answer, in this market, in this moment, is to constrain it. The buyer who has been burned has a very sensitive instrument for detecting when they are about to be burned again. That instrument is not your enemy. It is the most honest feedback loop your sales process has access to. Trust it. Sell to it. Build a business that deserves to survive the hype cycle.

The burned buyer is not a hard buyer. The burned buyer is a buyer who knows what they actually need. Meet them there.

How to Build a Golden Eval Set From Production

Alpesh Nakrani — Sat, 16 May 2026 18:30:00 GMT

A golden dataset for LLM evaluation is a frozen, versioned sample of real production traffic, paired with trusted reference answers, stratified by intent, and deliberately over-weighted toward the adversarial tail. You sample it from production, freeze it, version it like code, and never let it grow organically. That last rule is the one teams break, and it is why their eval numbers stop meaning anything.

I have watched a team build an eval set the wrong way and not notice for a quarter. A developer wrote a few cases while building the feature. A PM added some during review. The set grew with every sprint. By the time it had 600 cases, it measured what the team imagined users would do, not what users actually did. The two distributions had drifted far apart, and the green eval badge was lying.

This is the build process I trust for a golden eval set, step by step. It is the companion to my essay on evals that predict production, which makes the case for why sampling is the whole game, and it sits under my full guide to LLM evaluation. Here I focus on the mechanics: how to build an eval dataset that holds up as a fixed ruler.

Key takeaways

A golden dataset for LLM evaluation is a frozen, versioned slice of real production traffic with trusted reference answers, over-weighted toward the adversarial tail.
Sample from production, not from your imagination. Synthetic cases encode the team's assumptions, which is exactly what production breaks.
Stratify by intent, then size the set to 200 to 500 cases, and deliberately over-weight the hard buckets to roughly 20 to 25 percent.
Freeze the set and version it like code. Once frozen, your score can only go down, so a real regression shows up instead of hiding.
Refresh on a fixed cadence with a parallel overlap window, so you can tell genuine drift from a model regression.

A golden eval set is a fixed ruler, not a rubber band. Once you let it grow to chase a number, it measures nothing.

Sample your golden eval set from production, not your imagination

The first decision determines everything downstream: where the cases come from. A golden dataset for LLM evaluation has to be sampled from real production traffic. Synthetic cases written by the team encode the team's assumptions, and those assumptions are exactly what production breaks.

You cannot sample production before you have production, so the first version is honest synthetic. Treat it as a v0, not as ground truth. The moment you have request logs, harvest a stratified sample from them and replace the synthetic set (TrueFoundry, enterprise LLM benchmarking). The synthetic v0 gets you to launch. Real traffic keeps you there.

Sample across a window wide enough to catch real variation. Two weeks is a defensible minimum for most applications, because it captures the weekly cycle and a full deployment cycle of whatever calls your model (TrueFoundry). Monday traffic does not look like Saturday traffic. Sample one day and your set inherits one day's bias.

Stratify by intent before you size the set

A random sample of production traffic over-represents your easy, high-volume path and under-represents the cases that actually break. To build an eval dataset that predicts failure, stratify first, then sample within each stratum.

Start by bucketing real requests by intent: the distinct jobs users hire your feature to do. For a support assistant that might be returns, billing, account access, and product questions. Then size each bucket so the rare-but-costly intents are not drowned out. A practical target is 200 to 500 cases that cover the feature's full operational envelope (Maxim, golden dataset guide). Smaller than 200 and the per-stratum signal is noise.

Now over-weight the adversarial tail on purpose. The cases that break production are not spread evenly; they cluster. I deliberately oversample four buckets in the golden eval set:

Bottom-quartile model confidence on the original production response.
Cases a human reviewer corrected after the model answered.
Syntactically adversarial input: injection attempts, malformed requests, off-topic prompts.
Anything that previously caused an incident, no matter how resolved it feels.

That fourth bucket is the one teams skip. An incident feels closed once the hotfix ships. It is not closed until a future model candidate passes those exact cases on a held-out set. Otherwise you are one regression away from re-living it.

The split between the easy path and the tail is a design choice, so make it explicit. I aim for roughly 20 to 25 percent of the set in the adversarial buckets, with the remainder distributed across the real intent mix. That ratio is high enough to stress the model and low enough that the set still resembles production. There is no universal number here; the right ratio is the one that surfaces the failures your customers care about. Write the ratio down so the next person who refreshes the set does not quietly dilute it.

Random sampling and a curated golden set are not rivals; they answer different questions. Random sampling tells you how the model does on the average request. A curated, tail-weighted golden set tells you whether the model is safe to ship (random sampling vs golden dataset for regression tests). For a deploy gate, you want the curated set. For online monitoring, you sample live traffic. Both, in their place.

Attach trusted reference answers

An input without a trusted reference answer is not an eval case; it is a prompt. The reference, sometimes called ground truth or the target response, is the answer you have decided is correct for that input. Building these is the slow, expensive, unavoidable part of a golden eval set.

Have domain experts label the references, not whoever is free. The labeling step forces you to articulate precise quality criteria, which is most of the value (EvidentlyAI, LLM evaluation guide). If two experts disagree on the reference for a case, you have found an ambiguous spec, not a hard case. Resolve the spec before you label.

You can speed this up with LLM-assisted drafting, then human review, but the human owns the final reference. Measure inter-annotator agreement and treat low agreement as a signal that your rubric is underspecified (EvidentlyAI). This is the thesis in miniature: the machine drafts the work, the human evaluates and decides. For when an exact reference is impossible, score against a rubric instead, which I cover in my full guide to LLM evaluation.

Store the reference as data, not as a brittle exact-match string. Many production answers have several correct phrasings, so a strict string match will fail a good response and pass a lucky one. Capture the reference plus the grading rule alongside it: exact match for structured outputs, semantic similarity or a judge rubric for open-ended ones. The reference is the answer; the grading rule is how you decide the candidate met it. Conflating the two is how an eval set starts punishing correct answers and nobody trusts the number anymore.

Freeze the set and version it like code

Once the cases and references exist, freeze the whole thing as a named, immutable artifact. A frozen golden dataset is the only thing that lets you answer one question: is this model candidate better or worse than what we shipped six months ago, on exactly the same questions. Most teams cannot say that sentence with confidence. The frozen set is what earns it.

Version the set the way you version code. Each freeze gets a name, a date, and a commit. Here is the discipline I use, as a log against a frozen set.

# freeze the golden eval set as a versioned artifact

eval freeze \

--source prod-logs-2026-w23-to-w24 \

--strata intent,confidence,incident \

--size 412 \

--out golden-set-2026-w24-v2.jsonl

# a single frozen case, schema sketch

{

"id": "case-0417",

"intent": "billing",

"stratum": "incident-replay",

"input": "why was i charged twice in may",

"reference": "expert-labeled target answer",

"frozen_at": "2026-06-15"

}

# freezing means your score can only go down

set version golden-set-2026-w24-v2

cases 412 # immutable until next freeze

adversarial tail 22% # oversampled on purpose

Freezing has a consequence worth stating plainly: your score on an older frozen set can only go down. There is no quietly adding easy cases to lift the number. That is the point. When a model regresses, the frozen set tells you the truth instead of absorbing the regression into a moving baseline.

Set a refresh cadence so the ruler stays honest

A frozen set decays as the world changes. New intents appear, language shifts, the product adds features. So you refresh on a schedule, not on a whim, and you version each refresh. The set is frozen; the cadence is what keeps it current.

Cut a new version on a fixed interval, monthly or quarterly depending on traffic volatility, and feed real production failures back into each refresh. When a new failure mode shows up in production, it earns a place in the next freeze. Run the old version and the new version in parallel for an overlap period so you can compare scores across the boundary. Without the overlap, a refresh looks like a regression and you cannot tell drift from real change.

This is where the eval set connects to revenue. In an LLM product, a bad launch is rarely one incident. It is the slow erosion of the trust that makes the product sellable. A golden eval set that refreshes on cadence is the cheapest insurance against that erosion. The gate it enables costs less than the churn it prevents. If you want that gate run as managed infrastructure, that is the core of Devlyn's AI observability and monitoring work.

The honest trade-off

Over-weighting the adversarial tail makes your aggregate accuracy look worse than it would on a uniform sample. A golden eval set stacked with hard cases will report a lower number than a friendlier set, and someone will ask why your eval score dropped after you did it right.

That is the cost, and I pay it on purpose. A uniform sample flatters the model and hides the cases that generate support tickets and churn. An adversarial-weighted set tells you what breaks before a customer does. Ship the model that passes the hard set, not the one with the prettiest average. The aggregate number is for you; the tail is for the customer.

A uniform sample flatters the model. An adversarial set tells you what breaks before a customer finds it for you.

FAQ

What is a golden dataset for LLM evaluation?

A golden dataset for LLM evaluation is a frozen, versioned sample of real production traffic paired with trusted, expert-labeled reference answers. It is stratified by intent and over-weighted toward the adversarial tail so it measures what will break, not the easy path. You sample it from production, freeze it as an immutable artifact, and refresh it on a schedule rather than growing it case by case.

How big should an LLM test set be?

For most features, 200 to 500 cases that cover the full operational envelope is the practical range. Below 200 the per-stratum signal becomes noise, especially for rare intents. The number that matters is coverage of distinct intents and failure modes, not raw size. A focused 300-case set beats a sprawling 2,000-case set built from synthetic happy-path examples.

Why should I freeze the eval set instead of letting it grow?

Because a set that grows organically becomes a moving baseline, and a moving baseline cannot detect regression. When you freeze the set, your score on it can only go down, which means a real regression shows up as a real drop instead of being absorbed by new easy cases. Freezing turns the set into a fixed ruler. You add new cases through a versioned refresh, not by appending whenever you feel like it.

How often should I refresh a golden eval set?

Refresh on a fixed cadence, monthly or quarterly depending on how fast your traffic shifts, and cut a new version each time. Feed production failures and new intents into each refresh, and run the old and new versions in parallel for an overlap window so you can compare across the boundary. Refresh early when your online drift alerts fire.

If you are building your first golden eval set, start with two weeks of production logs, stratify by intent, and label 200 hard cases by hand. That alone catches most of what kills launches. When you are ready to make this the way your team works rather than a one-time chore, A Field Guide to Evals is the long-form version of this harness, and the sibling guides on building an evaluation framework and evaluating RAG show where the set plugs in. Sample from production, freeze it, version it, and trust the number it gives you over the one you wished for.

What a team is for after the machine does the work

Alpesh Nakrani — Fri, 15 May 2026 18:30:00 GMT

For most of my career, the org chart made intuitive sense. You needed people to produce things. You hired producers. You structured them into layers that could coordinate production at scale. Engineers wrote code. Designers made screens. Analysts pulled reports. Managers made sure the producers were producing. The shape of the organization followed the shape of the work, and the work was, fundamentally, about generation. Creating the artifact. Shipping the thing.

That assumption is cracking. Not dramatically, not all at once, but quietly, in the numbers that matter. When I look at what Devlyn's teams can actually output now versus two years ago, the ratio has shifted in ways that should force a serious conversation about what we are actually hiring people to do, and what the right shape of an organization looks like when generation is no longer the hard part.

The hard part, increasingly, is judgment.

The constraint moved. Org charts were built for a production bottleneck; when generation is cheap, the bottleneck shifts from "can we produce this?" to "is this the right thing?"
Roles split. Artifact-generation roles contract; judgment, specification, and decision-ownership roles expand. The senior-to-junior ratio tilts sharply senior.
The org gets flatter. The coordination middle thins, spans of control change, and the scarce skill becomes confident evaluation, not throughput you can count.

The org chart was built for a production constraint that no longer exists

Here is the old logic: you had more ideas than you had capacity to execute. The bottleneck was production. So you hired to the bottleneck. You hired engineers to write software because software had to be written by humans, one line at a time. You hired writers because copy had to be drafted by humans, one sentence at a time. You hired analysts because reports had to be assembled by humans, one query at a time. Your span of control, your team ratios, your hiring velocity, all of it was calibrated to the assumption that humans were the primary production unit.

The org chart you drew in 2019 was not wrong. It was correct for the constraint you were optimizing against. The problem is that the constraint has changed and most org charts have not.

When generation becomes cheap, when a capable model can produce a first draft, a working prototype, a data summary, a test suite, in seconds, the bottleneck moves. It moves from "can we produce this?" to "is this the right thing?" It moves from generation to evaluation. From throughput to direction. The machine can fill the canvas. The question is whether anyone in your organization actually knows what a good painting looks like, and whether they can specify it clearly enough that the machine makes the right one.

When generation becomes cheap, the bottleneck moves. It moves from "can we produce this?" to "is this the right thing?" From throughput to direction.

What contracts and what expands

Let me be concrete, because this conversation tends to get abstract in ways that obscure what is actually happening inside organizations.

What contracts: roles whose primary output is artifact generation. Junior engineers whose job was to implement clearly-specified tickets. Content producers whose job was volume. Analysts whose job was to run the same query in a new configuration. QA testers whose job was to manually execute test scripts. Not because these people are not valuable, but because the leverage available to a skilled senior person with good tooling now covers what previously required several people beneath them. The work gets done; fewer bodies touch it.

What expands: roles whose primary output is judgment, specification, and decision ownership. Senior engineers who can architect a system and then evaluate whether the model-generated implementation is actually sound. Product thinkers who can write a precise spec that constrains the output space. Editors who can tell the difference between a generated paragraph that passes and one that damages brand. Domain experts who can catch the confident wrong answer. People who own an outcome end-to-end, not a slice of the process.

At Devlyn, we have operationalized this as a hiring posture: Senior engineers only. No juniors hidden behind AI. That is not a statement about junior engineers being bad. It is a statement about what we actually need right now. We need people who can read model output and know immediately whether it is correct, not people who are still developing that calibration. The gap between a plausible-looking wrong answer and a correct one is invisible without deep expertise. Hiring people who cannot see that gap does not reduce the risk; it just buries it.

This is explored at length in Building an AI-Native Team: Hiring for judgment, not throughput, if you want the full framework for evaluating candidates in this environment. The short version is that the interview process needs to change entirely. You are not testing for production speed. You are testing for the ability to specify, evaluate, and own.

Spans of control have to change, and most managers are not ready for why

The traditional argument for span of control limits was coordination cost. A manager with twelve direct reports cannot give each of them the attention needed to keep work aligned and quality high. So you kept spans at six or seven, added layers, and scaled that way.

Two things happen when the production ratio shifts. First, a smaller team can produce more output, which means a manager coordinating a smaller headcount is now accountable for the output volume that previously required a much larger group. That changes the nature of the management job dramatically. You cannot manage an AI-augmented team the way you managed a headcount-equivalent team. The leverage is different. The failure modes are different. The thing you need to watch is different.

Second, the manager's job itself changes in character. The traditional manager spent a significant portion of time coordinating production, who is working on what, is it moving, what is the blocker. When generation is cheap and fast, that coordination function shrinks. The manager's actual job becomes: setting clear specifications, establishing quality evals, reviewing output against intent, and making the judgment calls that cannot be delegated. That is a different skill set than classic people management, and most managers promoted in the last decade were promoted for the old one.

The manager who thrives in this environment is not the one who is best at running standups and unblocking tickets. It is the one who can write a crisp spec, recognize when the output does not match it, and make a decision about what to do next without waiting for consensus. It is the one who can tell you, precisely, what good looks like, because if they cannot tell you, the model cannot be correctly steered, and the team will generate plausible work that is not quite the right work.

Judgment you can observe, not throughput you can count

Here is the hiring problem this creates: throughput is easy to measure. Lines of code, tickets closed, articles published, reports shipped. You can see throughput. You can count it. Judgment is much harder. You cannot run a job description that says "must demonstrate excellent judgment" and then test for it in a standard loop.

What I have learned, both at Devlyn and in conversations across the companies I advise, is that you have to engineer the interview process specifically to surface judgment, and you have to be willing to slow down and pay for it.

Some things that actually surface judgment: give candidates real work from your domain, with real ambiguity, and watch how they make sense of the constraint space before they produce anything. Ask them to evaluate output, not produce it. Show them a generated artifact and ask: what is wrong with this? What decision would you change? What would you need to know before you shipped this? People who have judgment can answer those questions. People who have been trained for throughput often cannot; they will pivot immediately to how they would produce something better rather than analyzing what is actually wrong with what is in front of them.

We have also learned to be explicit about ownership expectations. Our internal shorthand is: Ownership over hours. Outcomes over velocity. We are not measuring presence or pace. We are measuring whether the outcome was good and whether this person drove it. That shifts accountability in a way that selects for people who actually want to own things, which is a different population than people who are good at looking busy.

The broader framework here, how economies restructure when judgment becomes the scarce input, is something I think about through the lens of The Judgment Economy, which lays out where value concentrates when execution commoditizes. The shift happening inside companies is a micro version of the macro pattern: the humans who remain valuable are the ones who are doing the thing that is hardest to automate, which is not production. It is taste, intent, evaluation, and decision.

What the new org chart actually looks like

I want to resist drawing the definitive org chart because it varies by industry and company stage. But I can describe the shape. It is flatter. Fewer layers between the person setting intent and the output. Each person in the chain owns more surface area but has more leverage per hour of work. The "layer of people who receive clear specs from above and write clean code below" is thinner. The "people who write the specs" layer is thicker.

The ratio of senior to junior tilts sharply senior. Not because junior roles disappear entirely, there are still places where someone needs to develop craft, but because the leverage math changes. One senior engineer who can architect and evaluate is now worth three or four production-oriented juniors in terms of output quality you can trust. If you are trying to move fast and cannot afford to have a senior reviewing every line of junior output, you are better off with fewer people and higher floor-level judgment.

The evaluator role, someone whose job is specifically to review model output against quality and brand standards, becomes a real function rather than something bolted on informally. At Devlyn, we have found that the bottleneck on speed is rarely generation; it is confident evaluation. When you know the output is good enough to ship, you can move. When you are not sure, you loop. Building teams with strong evaluative capacity directly reduces that loop friction.

Cross-functional fluency matters more than it did. When a single person with AI tooling can produce what used to require a team, the question of whether that person understands the adjacent domain becomes critical. An engineer who cannot evaluate UX will generate technically correct implementations that miss the user. A designer who cannot evaluate technical feasibility will specify things that look right but cost three times as much to build. The traditional handoff model, produce here, hand off there, evaluate in another department, is too slow and too lossy when the pace of generation increases. You need people who can hold more of the stack in their heads.

The bottleneck on speed is rarely generation anymore. It is confident evaluation. When you know the output is good enough to ship, you can move.

The leadership implication nobody is saying out loud

I will say it: the people most at risk in this transition are mid-level managers who built their careers on coordination and throughput management, and senior leaders who mistake activity for output.

The coordination layer, the manager whose primary value was making sure the team was moving and blocking was removed, is thinner in an AI-augmented team because the production pace is faster and the work is more self-directing. You do not need as many people managing the pipeline when the pipeline runs faster. What you need is people who can set the intent clearly at the top and evaluate correctly at the bottom. The middle thins.

For leaders, the risk is a different kind. Leaders who have been rewarded for building headcount, for scaling teams, for managing complexity through organizational structure, they may resist the logic here because it runs counter to their intuitions about what scaling looks like. Scaling used to mean hiring. In an AI-augmented organization, scaling may mean keeping headcount flat while dramatically increasing the judgment density of the team you have. That is a different mental model of growth, and it requires leaders to stop treating headcount as the primary proxy for organizational capability.

The questions I bring to every leadership conversation now: What is the evaluation loop for AI-generated output in your organization? Who owns quality, and do they have enough seniority and domain knowledge to actually see problems? When you imagine the org chart two years from now, what assumptions about production cost are you baking in, and are those assumptions still correct?

For the detailed thinking I have been building out on this, including specific frameworks for evaluating team shape and hiring posture, I have covered this more fully in Org Charts After Automation: Points of View, Volume III. The patterns I am seeing across the companies I work with suggest we are still early, most teams have added AI tooling without rethinking the structure, which means they are getting productivity gains today but building technical debt into their org design that will be painful to unwind later.

The machine doing the work is not the disruption. The disruption is the implication for what you need the humans to do. Production is no longer the constraint. Judgment is. The org chart should reflect that, and most of them do not yet.

Frequently asked questions

How does org structure change after AI automation? The org chart flattens. Layers built to coordinate human production thin out, spans of control change, and the senior-to-junior ratio tilts sharply senior. Fewer people own more surface area, with more leverage per hour, and the work shifts from generating artifacts to specifying intent and evaluating output.

Should you stop hiring junior engineers when AI handles generation? Not universally, but the leverage math changes. The posture I run at Devlyn is senior engineers only: people who can read model output and know immediately whether it is correct. The gap between a plausible wrong answer and a right one is invisible without deep expertise, so hiring people who cannot see that gap buries risk rather than reducing it. If you are building a team along these lines, this is the work we do at Devlyn.

Who is most at risk in this transition? Mid-level managers whose value was coordination and throughput management, and senior leaders who mistake activity for output. When the pipeline runs faster and the work is more self-directing, the coordination middle thins. What survives is the ability to set intent clearly at the top and evaluate correctly at the bottom.

How to Reduce LLM Inference Cost Without Wrecking Quality

Alpesh Nakrani — Thu, 14 May 2026 18:30:00 GMT

You reduce LLM inference cost by pulling five levers in roughly this order: route most traffic to the smallest model that clears the bar, cache the parts of your prompt that repeat, quantize the models you self-host, cut the tokens you send and generate, and batch the work that can wait. Underneath all of it sits one decision that sets the ceiling on the rest: whether you adapt the model through RAG or fine-tuning. Get the order right and most teams find their bill was 2 to 5 times bigger than the workload actually required.

I have sat in the billing review where the inference line went up and to the right faster than revenue, and I have watched a team panic-shop for a cheaper model when the real problem was that every request carried a 6,000-token system prompt nobody had read in months. The cost was never the model. It was the design around the model. This piece is the hub for that design. Each lever below gets a short version here and a full deep dive of its own, because the levers compound and the order you pull them in matters more than any single one.

The cost is rarely the model. It is the design around the model, and most of that design was set on a Tuesday by someone who has since moved teams.

Key takeaways

If you read nothing else, these are the load-bearing claims:

Output tokens are the expensive ones. On most frontier APIs in 2026, output is priced 5x to 6x higher than input, so the cheapest token is the one you never generate.
Routing is the lever with the most upside for most teams. Sending the easy 80% of requests to a small model and escalating only the hard tail routinely cuts spend more than any model swap.
Prompt caching is nearly free money on stable prefixes. Anthropic prices a cache read at 0.1x the base input rate, a 90% discount on the cached portion (Claude pricing).
Quantization buys roughly 2x to 4x memory savings on self-hosted models, but the quality drop is task-specific, so you measure before you ship it.
The metric that should govern all of this is cost per resolved task, not cost per token. A cheaper call that fails more often is not cheaper once a human cleans up the mess.

What actually drives your LLM inference cost

Before you optimize anything, you have to know what you are paying for. Four drivers explain almost every inference bill, and three of the four are things you control with code, not contracts.

Model tier. The headline number. A frontier model costs an order of magnitude more than a capable mid-size one. As of mid-2026, Anthropic prices Claude Opus 4.5 at $5 per million input tokens and $25 per million output, Sonnet at $3 and $15, and Haiku 4.5 at $1 and $5 (Claude pricing). OpenAI lists GPT-5.5 at $5 input and $30 output per million (OpenAI pricing). The spread between the cheapest and most expensive model in a single vendor's lineup is 5x or more, before you do anything clever.

Output tokens. Notice that in every pricing line above, output costs 5x to 6x what input does. That asymmetry should change how you write prompts. A verbose model that "thinks out loud" for 800 tokens to answer a yes-or-no question is burning the most expensive resource you buy. Constraining output length is one of the few changes that cuts cost and latency at the same time.

Context length. Every token you stuff into the prompt is a token you pay for on every single call. That 6,000-token system prompt I mentioned is not a one-time cost; it is a tax on every request for the life of the feature. Long context is sometimes worth it. It is rarely worth it by accident.

Multi-step agents. This is the driver that surprises people. A single agent task can fan out into ten, twenty, fifty model calls, each carrying the growing conversation as context. The per-call price looks cheap. The per-task price is what lands on your invoice, and for agentic workloads it can be 50x the cost of a single completion. If you are building agents, read how to cut the tokens an agent burns before you scale the loop.

If your inference bill is already a board-level line item and you would rather have engineers who have shipped this before than learn it on production traffic, that is the work the Devlyn team does. Now to the levers, in the order I would pull them.

Lever 1: Right-size the model, then route

The single biggest lever, and the one most teams reach for last, is using a smaller model. Not because small models are trendy, but because most production requests are easy, and you are paying frontier prices to answer easy questions. The discipline is to find the smallest model that clears the bar for your task, make it the default, and escalate to a bigger model only when the small one genuinely cannot cope.

That escalation pattern is called a cascade, or model routing. You run every request through a cheap model first, check the output against a confidence or quality signal, and pass it up the chain only on failure. Most requests never escalate, so you pay top-tier prices only for the hard tail. The routing logic itself is cheap, and even crude rules capture most of the savings. I covered the full design space in the guide to LLM model routing, and the case for defaulting small in the CRO's case for shipping smaller models.

The two trade-offs to name honestly: routing adds a moving part that can break, and a badly tuned router that escalates too often gives you the cost of the big model plus the latency of the small one. You tune it against real traffic, not a hunch. If you want the longer treatment, In Defense of Small Models and Model Routing are the two book chapters I point teams to most often.

Lever 2: Cache what repeats, by prefix and by meaning

A surprising fraction of what you send a model is identical from call to call: the system prompt, the tool definitions, the few-shot examples, the policy document. You are paying to reprocess all of it every single time, and you do not have to.

Prompt caching stores the stable prefix of your request so later calls read it from cache instead of recomputing it. The pricing is hard to argue with: Anthropic charges a cache read at 0.1x the base input rate, which is a 90% discount on the cached tokens, with a small write premium the first time (Claude pricing). For any feature with a large fixed preamble, this is close to free money, and I walk through the breakeven math in the prompt caching guide.

Semantic caching goes one step further. Instead of matching identical prefixes, it matches requests that mean the same thing, so a question phrased two different ways can return one cached answer without a model call at all. The savings can be large on high-repetition workloads like support and search, but the risk is real too: a too-loose similarity threshold serves a stale or wrong answer to a question that only looked similar. The discipline lives in the semantic caching guide, and the broader question of what belongs in context at all is the subject of the Context Windows chapter.

Lever 3: Compress the model with quantization

If you self-host, quantization is the lever that lowers the floor under everything else. It stores the model's weights at lower numerical precision, dropping from 16-bit to 8-bit or 4-bit, which shrinks memory footprint by roughly 2x to 4x and lets the same model run on cheaper hardware, often faster.

The honest caveat is the one teams skip: quality degradation from quantization is task-specific. On many tasks the drop falls inside the noise of your eval suite and you would never notice. On tasks that lean on careful numerical reasoning, long-context faithfulness, or fine-grained instruction following, the drop can be real and material. The rule is simple and non-negotiable: run your evals at full precision and at the quantized precision, compare by failure mode, and only then decide. The savings are frequently real. So is the quality drop. The only way to know which dominates for your task is to measure, and I cover how in the LLM quantization guide.

Lever 4: Cut the tokens you send and generate

This is the unglamorous lever that pays back fastest, because it costs you nothing in infrastructure. Most prompts carry tokens that earn nothing. Industry write-ups in 2026 estimate that the average API call wastes a large share of its input tokens on context the model does not need, and you pay for every wasted token twice, once in the bill and once in the latency (Morph, LLM inference optimization).

The work here is mundane and effective: prune the system prompt to what the model actually uses, drop few-shot examples once a fine-tune or a better instruction makes them redundant, summarize conversation history instead of replaying it verbatim, and cap output length to what the task needs. Each change is small. Together they routinely take a meaningful bite out of the bill with zero quality cost, because you are removing tokens that were never doing any work. The full method, including how to find the dead weight in an agent loop, is in the token optimization guide.

You pay for every wasted token twice: once in the bill, and once in the latency your user sits through.

Lever 5: Batch and schedule the work that can wait

Not every request needs an answer in 800 milliseconds. Overnight enrichment, bulk classification, evaluation runs, content generation pipelines: these are throughput problems, not latency problems, and they should be priced like it.

On hosted APIs, the Batch endpoint is the lever. Anthropic and OpenAI both offer roughly a 50% discount on input and output for asynchronous batch processing, the trade being that you accept a longer turnaround in exchange for half-price tokens (Claude pricing). If a workload does not need to be synchronous, leaving it on the real-time endpoint is leaving half the bill on the table.

If you self-host, the equivalent lever is continuous batching at the serving layer. Instead of waiting for a whole batch to finish before starting the next, the server injects new requests into the compute stream as slots free up. Anyscale's benchmarks reported up to 23x throughput over naive batching with vLLM and PagedAttention, and consistently better latency across percentiles (Anyscale). The headline 23x is a best case; the everyday gain on mixed traffic is more like 2x to 4x, which is still the difference between one GPU and three.

RAG vs fine-tuning: the cost decision behind adaptation

Underneath the five levers is a more fundamental choice that sets the cost ceiling for the whole system: how you adapt a general model to your specific knowledge and task. The two paths, retrieval-augmented generation and fine-tuning, have very different cost shapes, and picking the wrong one taxes every request forever.

RAG keeps the model general and feeds it relevant context at query time. Cheap to set up and easy to update, but it inflates every prompt with retrieved chunks, so it raises your per-call input cost permanently. Fine-tuning bakes the knowledge or behavior into the weights. It costs more upfront and is slower to update, but it can shrink your prompts dramatically and let a smaller model do a bigger model's job, which compounds with Lever 1. The right answer depends on how often your knowledge changes and how repetitive your task is, and I work through the decision in RAG vs fine-tuning, with the longer versions in RAG That Survives Contact and To Fine-Tune or Not. If you want a team to build the retrieval layer rather than learn it the hard way, Devlyn does RAG and knowledge integration as a core practice.

A decision table you can pull from

Here is the same set of levers in one place: when to reach for each, roughly what it saves, and how much work it is to ship. Treat the savings figures as directional, not contractual, because the real number depends entirely on your traffic shape.

Lever	When to use it	Typical savings	Effort
Route to smaller model (cascade)	Most requests are easy; a hard tail needs the big model	Large; often the biggest single win	Medium
Prompt caching	Large fixed prefix reused across calls (system prompt, docs, tools)	Up to 90% on the cached portion (vendor-priced)	Low
Semantic caching	High-repetition queries (support, search, FAQ)	Large on repetitive traffic; watch the staleness risk	Medium
Quantization (self-host)	You run your own models and have GPU headroom to reclaim	~2x to 4x memory; meaningful cost cut if quality holds	Medium
Token / prompt trimming	Always; especially agent loops with growing context	Small per change, compounds; near-zero quality cost	Low
Batch API (async)	Work that tolerates delay (enrichment, evals, bulk jobs)	~50% on input and output (vendor-priced)	Low
Continuous batching (self-host)	You serve your own models at meaningful concurrency	2x to 4x throughput typical, up to ~23x best case	Medium
RAG vs fine-tuning choice	Adapting a general model to your knowledge or task	Sets the ceiling for every other lever	High

Three illustrative cost stories

These are composite scenarios, not specific clients, built from patterns I have seen repeat. The numbers are illustrative and rounded to make the mechanics legible, not pulled from a live system.

The support bot that forgot to cache. A team runs a support assistant on a frontier model, every request carrying a 5,000-token policy preamble. At a few hundred thousand calls a month, the preamble alone is most of the bill. They turn on prompt caching, the preamble becomes a cache read at 0.1x, and the input portion of the bill drops by most of its value overnight. No model change, no quality change, one config field. This is the lever I check first because it is the cheapest to try and the most often skipped.

The classifier on the wrong tier. An intake flow uses a top-tier model to classify incoming messages into one of eight categories, because that is the model someone wired in during the prototype and nobody revisited it. The task is easy; a small model handles it at near-identical accuracy after a light fine-tune. Moving classification to the small model and reserving the frontier model for the genuinely ambiguous cases cuts the per-task cost by an order of magnitude, and the latency improves as a bonus. The prototype default was quietly the most expensive decision in the system.

The agent that fanned out. An autonomous research agent makes, on average, thirty model calls per task, each replaying the full growing transcript as context. The per-call price looks trivial. The per-task price is anything but, and it scales with usage in the worst possible way. Summarizing the transcript between steps instead of replaying it verbatim, plus capping each step's output, takes a large bite out of per-task cost with no loss in the final answer. The fix was in the token budget, not the model.

The metric that should govern LLM inference cost: cost per resolved task

Every lever above optimizes cost. The thing you should actually be optimizing is cost per resolved task: total inference spend divided by the number of tasks the system fully handled without a human finishing the job. Cost per token is what the model charges you. Cost per resolved task is what the model costs you, and only one of those is on the P&L.

The trap is optimizing the wrong number. A cheaper model that resolves 70% of tickets is not cheaper than a pricier one that resolves 90%, once you price in the human cleaning up the other 30%. The token cost per call dropped; the cost per resolved task went up. I make the full case in the complete guide to LLM evaluation, and the tooling to measure it in the evaluation tools guide. Tracking this in production, rather than in a one-off spreadsheet, is squarely an AI observability and monitoring problem.

Cost per token is what the model charges you. Cost per resolved task is what the model costs you. Optimize the second one.

This is also where cost optimization stops being an engineering hobby and becomes a margin argument. When you can say "we resolve the same share of tasks for 40% less spend," you have turned an infrastructure project into a sentence a CFO will fund. That is the sentence most teams cannot say, because they measured tokens and never measured resolution.

Frequently asked questions

How do I reduce LLM inference cost the fastest? Start with the two cheapest levers to ship: turn on prompt caching for any large fixed prefix, and trim the dead tokens out of your prompts. Both are low-effort and low-risk. Then route the easy majority of requests to a smaller model, which is usually the single biggest win but takes more tuning. Batch anything that can tolerate a delay for the vendor's roughly 50% async discount.

Why are output tokens more expensive than input tokens? Generating tokens is sequential and compute-heavy: the model produces them one at a time, each conditioned on all the ones before. Processing input can be parallelized far more. That is why vendors price output 5x to 6x higher than input, and why capping output length is one of the few changes that cuts both cost and latency at once.

Does a smaller model always cost less to run? Per token, yes. Per resolved task, not always. If a small model fails often enough that a human has to finish the job, the cleanup cost can erase the token savings. That is why you measure cost per resolved task, not cost per token, and why a routed cascade beats a flat downgrade: it keeps the big model available for the hard cases that would otherwise fail.

Should I optimize cost before I have an eval suite? No. Every cost lever is a trade against quality, and without evals you cannot see the quality side of the trade. Quantize without evals and you ship a silent regression. Route without evals and you escalate the wrong cases. Build the eval harness first, then optimize against it so each change shows its true cost in both dollars and quality.

If you want the full operator's playbook on adapting and running models economically, the book chapters on small models, routing, and pricing intelligence go deeper than any single article can. And if you would rather have a team build the routing, caching, and observability into your stack from day one instead of retrofitting it after the bill scares someone, that is exactly what Devlyn's engineers do. Measure cost per resolved task. Pull the levers in order. Ignore the leaderboard.

RAG vs Fine-Tuning: When Each Wins in 2026

Alpesh Nakrani — Wed, 13 May 2026 18:30:00 GMT

RAG vs fine-tuning is the wrong fight. RAG handles knowledge that changes; fine-tuning shapes behavior that persists. Here is when each wins, and why most teams end up shipping both.

The short answer to RAG vs fine-tuning is this: use retrieval when the model needs facts that change, and fine-tune when you need behavior that stays fixed. RAG injects fresh, citable knowledge at query time without touching the weights. Fine-tuning bakes in tone, format, and decision patterns the model should apply every time. They solve different problems, and the moment you see that, most of the argument dissolves.

I have watched more than one team burn a quarter fine-tuning a model to fix a problem that retrieval would have solved in a week, for a tenth of the cost, and without the part where the fine-tune goes stale the next time the underlying data moves. The mistake is almost never "we picked the wrong tool." It is "we never named which problem we were actually solving." This piece is about naming that, settling the decision, and then telling you the honest answer most of these comparisons bury: in production, the serious systems run both.

This is one of the deep dives under my guide to reducing LLM inference cost without wrecking quality, because how you adapt the model sets the ceiling on everything downstream: what you cache, what you route, what you quantize. Get the adaptation decision wrong and you spend the rest of the year optimizing around a foundation you should not have poured.

Key takeaways

RAG is for knowledge that changes; fine-tuning is for behavior that persists. Get the seam right and the rest of the decision falls out of it.
RAG wins on freshness, provenance, and time-to-ship. It needs no labeled data and no training run, and every answer can cite its source.
Fine-tuning wins on behavior, latency, and token cost at scale. It shines when style, format, or a smaller-faster model is the goal, not when the facts keep moving.
LoRA and QLoRA changed the math. Parameter-efficient methods get you roughly 95% of full fine-tuning quality for about 10% of the cost, which turned fine-tuning from a six-figure project into a three-figure one.
The hybrid is the real 2026 answer. Retrieval for facts, fine-tuning for behavior. Most production systems that survive contact with real traffic end up running both.

RAG vs fine-tuning: the one distinction that settles the argument

Almost every bad version of this decision starts by comparing RAG and fine-tuning on a feature checklist, as if they were two brands of the same thing. They are not. The cleaner frame is to ask what kind of thing you are trying to change about the model, and there are only two: what it knows, and how it behaves.

Knowledge is the stuff that has a timestamp. Your product catalog, your support docs, last night's pricing, a customer's order history, a policy that legal rewrote on Tuesday. This information changes, and when it changes you need the model to use the new version immediately, not the version it saw during training. Retrieval is built for exactly this. You keep the knowledge in a store you control, fetch the relevant pieces at query time, and the model reasons over them on the spot.

Behavior is the stuff that does not have a timestamp. The voice the model answers in, the JSON shape it returns, the way it refuses out-of-scope requests, the reasoning pattern it follows on a domain-specific task. This is what fine-tuning is for. You are not teaching the model new facts so much as teaching it a new default way of acting, and you want that default to hold across every request regardless of what got retrieved.

Once you separate those two, the question "RAG or fine-tuning" usually answers itself per problem. If the thing you are unhappy with is the model citing a stale price, that is a knowledge problem and no amount of fine-tuning fixes it durably. If the thing you are unhappy with is the model writing like a generic chatbot when your brand voice is dry and specific, that is a behavior problem and stuffing more documents into the context will not fix it. Diagnose the problem as knowledge or behavior first, and the tool follows.

Fine-tuning a model to fix a freshness problem is like memorizing a phone book to avoid looking up a number. It works until the number changes, which it always does.

When RAG wins

RAG is the right call more often than the fine-tuning enthusiasts want to admit, and the reasons are mostly operational rather than academic. Reach for retrieval when one or more of these is true.

The knowledge is large, dynamic, or both. If the information the model needs updates faster than you would want to run a training job, retrieval is the only sane answer. You update a document in your store and the next query sees it. There is no retraining, no redeploy, no waiting for a fine-tune to bake. For anything with a freshness requirement measured in hours or days, this is decisive.

You need provenance. Because every RAG answer is built from retrieved chunks, you can show the customer exactly which source backed the claim. In regulated work, in anything touching compliance or audit, this is not a nice-to-have. A fine-tuned model produces fluent answers with no receipt attached, which is a liability the first time someone asks why the system said what it said. I have walked through what it takes to make retrieved answers trustworthy enough to cite in how to evaluate a RAG system.

You do not have labeled training data. Fine-tuning needs examples, often thousands of them, curated and clean. Most teams do not have that lying around, and building it is a real project. RAG needs your documents, which you already have. The time-to-first-useful-answer is days, not quarters.

If retrieval is where you are headed and you want it to survive past the demo, that is the exact problem Devlyn's RAG knowledge integration work is built around. The naive version is easy; the version that holds up under real queries, messy documents, and changing data is where teams get stuck. My book RAG That Survives Contact walks through the failure modes that show up around month three, which is when the prototype's cracks usually surface.

When fine-tuning wins

Fine-tuning earns its keep when the problem is behavioral or economic, not informational. Reach for it in these cases.

You need consistent behavior, tone, or output format. If the model has to answer in a specific voice, follow a house style, or return a rigid structure every single time, fine-tuning encodes that far more reliably than a prompt the size of a short novel. You can spend thousands of tokens per request describing the behavior you want, or you can train it in once and stop paying that tax on every call.

You are optimizing latency or token cost at scale. A fine-tuned smaller model can match a larger general model on a narrow task, which lets you replace an expensive frontier call with a cheap specialist one. At high volume that is the difference between healthy margin and a bill that grows faster than revenue. I make the broader version of this case in the CRO's case for shipping smaller models, and it routinely starts with a fine-tune.

You need deterministic handling of known edge cases. If there is a specific class of input the base model keeps getting wrong, and you can produce examples of the right behavior, fine-tuning teaches it the pattern in a way that retrieval cannot. Retrieval gives the model better facts; it does not change how the model reasons about them.

What changed in 2026 is that fine-tuning stopped being expensive. Parameter-efficient methods like LoRA and QLoRA train a tiny fraction of the model's weights, and the published cost analyses put it bluntly: LoRA reaches roughly 95% of full fine-tuning performance for about 10% of the cost, training 1 to 10% of parameters on a single GPU rather than a cluster (Stratagem Systems, 2026). In practical terms, a LoRA run can land in the $50 to $300 range where a full fine-tune of the same model would run $5,000 to $15,000. That collapse in cost is the single biggest reason fine-tuning re-entered the default toolkit.

RAG vs fine-tuning vs hybrid: the decision table

Here is the comparison in one place: RAG, fine-tuning, and the hybrid, across the four dimensions that actually drive the decision in production.

Dimension	RAG	Fine-tuning	Hybrid
Upfront cost	Low (no training run)	Low to moderate with LoRA; high for full fine-tune	Moderate (you pay for both)
Freshness	Excellent - update the store, see it instantly	Poor - frozen at training time	Excellent for facts, fixed for behavior
Control / provenance	High - every answer can cite its source	Low - fluent answers, no receipt	High on facts, encoded on behavior
Ongoing effort	Maintain the retrieval pipeline and data	Re-tune when the base model or domain moves	Maintain both, but each does its own job
Best for	Knowledge that changes; citations; fast ship	Behavior, tone, format; latency; cost at scale	Systems where freshness and behavior both matter

If you read that table and conclude "hybrid, obviously," hold on. The hybrid wins on quality whenever both freshness and behavior matter, but it costs you two systems to build and maintain. Plenty of products genuinely only have a knowledge problem, or only a behavior problem, and for those the single-tool answer is correct and cheaper. Do not pay for both because a table told you to.

LoRA, QLoRA, and RAFT: what changed the math

Three acronyms are doing most of the work in the 2026 version of this decision, and they are worth understanding because they reshape the trade-offs the older comparisons assume.

LoRA and QLoRA are parameter-efficient fine-tuning, or PEFT. Instead of updating all the model's weights, you train small adapter matrices and leave the base frozen. QLoRA adds quantization so the whole thing fits on commodity hardware. The effect on the decision is that the old "fine-tuning is a capital project" objection is mostly dead. When a domain-specific fine-tune costs a few hundred dollars and a weekend instead of a cluster and a quarter, the bar for choosing it drops a long way. If you want the deeper treatment of when to spend that effort at all, Fine-Tuning or Not is the framework I keep coming back to.

RAFT is the one that confuses the "RAG vs fine-tuning" framing the most, because it is both. Retrieval-Augmented Fine-Tuning, from UC Berkeley, trains the model on examples that include both the relevant retrieved documents and deliberate distractor documents, with chain-of-thought answers, so the model learns to use retrieval well rather than just having retrieval bolted on at inference. The reported gains are not marginal: RAFT improved HotpotQA accuracy by 35.25% over an instruction-tuned Llama-2 baseline, and by 30.87% over a domain-specific fine-tune alone (SuperAnnotate, summarizing the UC Berkeley RAFT paper). The takeaway is that retrieval and fine-tuning are not rivals at the bottom of the stack. The best results come from fine-tuning the model to be good at retrieval.

None of this works without measurement, by the way. Whether a fine-tune helped, whether retrieval is grounding the answer or the model is confabulating, whether the hybrid is worth its cost, are all questions you answer with an eval suite, not a vibe. I lay out the metrics that actually predict production behavior in my guide to LLM evaluation.

The hybrid most teams actually ship

Here is the part the clean comparisons skip. When you watch what serious teams run in production in 2026, the answer is rarely RAG or fine-tuning. It is both, split along the knowledge-versus-behavior seam. The reported share of production systems using both has been climbing toward the majority, and that matches what I see in practice.

The pattern is consistent: fine-tune the model for behavior, format, and the reasoning pattern your task needs, then layer retrieval on top for the facts. The fine-tune handles "answer like our brand, in this JSON shape, refusing these categories." The retrieval handles "and here is today's actual data to answer over." Neither tool is asked to do the other's job, which is exactly why it holds up.

A useful illustration, with numbers chosen to make the shape clear rather than to report a specific system: imagine a support assistant where the base model, prompted hard, resolves about 70% of tickets and writes in a tone the brand team keeps flagging. Fine-tuning on a few thousand cleaned past tickets fixes the tone and lifts clean resolutions to the low 80s, but the model still cites policies that changed last month. Add retrieval over the live policy store and the stale-answer complaints fall away, because the facts now come from a source that updates the moment legal does. Behavior came from the fine-tune; freshness came from RAG. That split is the whole game.

The contrarian version of this advice is worth saying plainly: if you are early and unsure, start with RAG alone. It ships faster, it is cheaper to be wrong with, and it tells you quickly whether your problem is actually about knowledge. Reach for fine-tuning once you have evidence that the residual problem is behavioral, not informational. Building both on day one, before you know which lever moves your metric, is how teams end up maintaining two systems to solve one problem.

The cost no one prices in: maintenance

Every cost comparison I have read prices the training run and the inference tokens. Almost none of them price the thing that actually hurts: a fine-tune is a liability you re-pay every time the world moves.

When you fine-tune, you create a frozen artifact tied to a specific base model and a specific snapshot of your domain. The base-model provider ships a better, cheaper version, and your fine-tune is stuck on the old one until you redo the work. Your domain shifts, your product changes, your policies update, and the behavior you trained in drifts out of date. Retrieval pipelines have maintenance too, but it is the boring kind, keeping the data fresh and the index healthy, and that work pays off the moment you do it. A fine-tune's maintenance is a periodic re-investment that produces nothing new, just parity with where you already were.

This is the revenue consequence operators miss. The fine-tune that looked cheap at $200 of compute is not a one-time $200. It is $200 plus the engineering time to rebuild it every time you want to ride a better base model, plus the opportunity cost of the upgrades you skip because re-tuning is a hassle. I have seen teams stay on a worse, more expensive base model for a year because migrating their fine-tune was nobody's priority. Price the artifact's whole life, not just its birth, and the hybrid's "retrieval for what changes" half starts looking like the cheaper kind of complexity.

A fine-tune is not a purchase. It is a subscription you pay in engineering time, and the bill comes due every time the base model or your domain moves.

If you are weighing this for a real system and want a team that has shipped the hybrid and lived with its maintenance, that is what the Devlyn engineering team works on. The hard part was never picking RAG or fine-tuning. It was building the version that still works in month six.

Frequently asked questions

Is RAG cheaper than fine-tuning?

Usually to start, yes. RAG has no training run, so the upfront cost is low and you can ship in days. Fine-tuning has dropped sharply with LoRA, often into the low hundreds of dollars, but it carries a maintenance cost RAG does not: you re-pay the work whenever the base model or your domain changes. At very high query volume, a fine-tuned smaller model can be cheaper per call than RAG's longer contexts, which is why high-scale systems often use both.

When should I fine-tune instead of using RAG?

Fine-tune when the problem is behavior, not knowledge. If you need a consistent tone, a strict output format, deterministic handling of known edge cases, or a smaller-faster model that matches a larger one on a narrow task, fine-tuning is the right lever. If the problem is that the model needs current or large-volume facts, that is a retrieval problem and fine-tuning will not fix it durably.

Can I use RAG and fine-tuning together?

Yes, and most serious production systems do. The standard pattern is to fine-tune the model for behavior, format, and reasoning, then add retrieval for the facts that change. RAFT takes this further by fine-tuning the model specifically to use retrieved context well, and the reported benchmark gains over either approach alone are substantial.

Does fine-tuning add new knowledge to a model?

Poorly, and not durably. Fine-tuning can nudge a model toward facts it saw in training data, but it is the wrong tool for knowledge that updates, because the knowledge is frozen at training time and goes stale. For anything with a freshness requirement, retrieval is the reliable way to give a model current information, and it has the added benefit that every answer can cite its source.

If you want the full framework for this decision, the maintenance math, the seam, the hybrid patterns, my books Fine-Tuning or Not and RAG That Survives Contact go deep on each side. And if you would rather have a team build the version that holds up in production, Devlyn's RAG knowledge integration work is built for exactly that. Pick the tool that matches the problem. Most of the time, the problem is both.

Prompt Caching: What It Is and When It Saves Money

Alpesh Nakrani — Tue, 12 May 2026 18:30:00 GMT

Prompt caching is a way to reuse the part of a prompt the model has already processed, so the repeated tokens get billed at a steep discount instead of full price on every call. It saves money when a large, stable chunk of your prompt repeats across many requests inside the cache window: a long system prompt, a set of documents, your few-shot examples, or the running history of a conversation. If your prompts share almost nothing from one call to the next, it saves you nothing and adds a write fee. The whole game is prefix stability.

I have sat in the inference billing review where the line item kept climbing and nobody could explain why, until we looked at the actual requests and found a 7,000-token system prompt riding along on every single call. We were paying full input price to re-send the same instructions thousands of times an hour. Prompt caching fixed most of that bill in an afternoon. This piece is the deep dive on that lever from my guide to reducing LLM inference cost, where caching is one of five levers worth pulling.

Prompt caching does not make tokens cheaper. It makes you stop paying to re-read the same tokens you already paid to read a second ago.

Key takeaways

If you read nothing else, these are the load-bearing claims:

Prompt caching reuses an exact, contiguous prefix. It is not meaning-based; change one token early in the prompt and the cache for everything after it is gone.
Put the stable content first and the variable content last. Caching only works on the prefix, so a single moving token near the top wipes the whole cache for that request.
On hosted APIs you pay a write fee, then read at roughly 10% of input price. Anthropic charges 1.25x input to write a 5-minute cache and 0.1x to read it, so it pays off after one or two hits.
It saves nothing on cold, one-off prompts. The payoff is hit rate, and hit rate comes from request patterns that repeat a long prefix inside the TTL.
Prompt caching reuses exact text; semantic caching reuses meaning. They solve different problems and you often want both.

What prompt caching is, and what it is not

Prompt caching stores the model's intermediate computation for a chunk of your prompt and reuses it on the next request that starts with the exact same tokens. On a hosted API like Anthropic or OpenAI, you do not see the internals; you mark a prefix as cacheable, and a later request that matches that prefix reads it back at a deep discount instead of being reprocessed. The match has to be exact and contiguous from the start of the prompt. This is why people also call it prefix caching.

The word "prefix" is the whole story. Caching works left to right from the top of your prompt and stops at the first token that differs. If your system prompt and documents are identical but you slipped a timestamp or a request ID into the second line, the cache breaks right there and everything after it is recomputed at full price. The cache is not fuzzy and it is not smart. It is a literal string match on the front of your input.

That exactness is what separates prompt caching from semantic caching, which I cover in its own article. Semantic caching asks "have I seen a question that means roughly the same thing?" and returns a stored answer. Prompt caching asks "have I processed this exact text already?" and skips the recompute. One reuses meaning, the other reuses tokens. They are not competitors; in a real system you often run both, and confusing them leads to caching the wrong layer.

One more boundary worth drawing early: prompt caching is about the input you send, not the output the model generates. It lowers the cost and latency of feeding context in. It does not store or reuse completions. If you want to avoid regenerating an answer you have produced before, that is a response cache or semantic cache, a different tool in the same drawer.

How prefix caching works under the hood

When a transformer reads your prompt, it computes a set of internal tensors called the KV cache, one entry per token, that the model needs in order to attend to everything that came before. Normally that work is thrown away after the response. Prefix caching keeps it. The next request that shares the same opening tokens reuses those tensors instead of recomputing them, which is why the savings show up as both lower cost and faster time-to-first-token.

On self-hosted stacks you can see exactly how this is done. vLLM implements automatic prefix caching with a hash-based block scheme: it splits the KV cache into fixed blocks and hashes each block from its own tokens plus the hash of the prefix before it, so a block is only reused when the entire chain of tokens leading up to it matches. It caches full blocks only, and evicts on a least-recently-used policy when it needs space (vLLM docs). SGLang's RadixAttention does the same job with a radix tree of cached prefixes plus LRU eviction and cache-aware scheduling (SGLang paper).

The mechanics matter because they explain the failure modes. Block-level hashing is why a one-token change near the top of a long prompt is so expensive: it changes the hash of the first block, which changes every dependent block after it, so nothing downstream can be reused. LRU eviction is why a cache entry you expected to be warm can be cold under load: a burst of other traffic pushed your blocks out of memory. None of this is mysterious once you know the cache is a chain of hashed blocks, not a single blob.

# the cache breaks at the first token that differs

# BAD: volatile token sits in the prefix

system: "You are a support agent. Request id: req-8a3f. ...4000 more tokens..."

# every request has a new id, so the prefix never matches -> 0% hit rate

# GOOD: stable content first, volatile content last

system: "You are a support agent. ...4000 stable tokens..."

user: "Request id: req-8a3f. {the actual question}"

# the 4000-token prefix is identical every call -> high hit rate

If you are doing this work and want a team that has built caching into production pipelines rather than bolted it on after the bill arrived, the Devlyn engineering team works on exactly this.

Vendor support and pricing in 2026

Every major provider now supports some form of prompt caching, but the pricing models and TTLs differ enough that you cannot reason about one from another. Here is the verified state as of mid-2026, sourced to each vendor's own documentation. Treat the dollar figures as snapshots; the multipliers are the durable part.

Provider	Cache read discount	TTL	Notes
Anthropic (Claude API)	Read at 0.1x input (90% off); write costs 1.25x (5-min) or 2x (1-hour)	5 minutes (default) or 1 hour	Explicit cache breakpoints; min 1,024 tokens (Sonnet/Opus 4.5+), 4,096 for Haiku 4.5 (Anthropic pricing)
OpenAI (API)	50% off cached input on GPT-4o-era; up to ~90% cost reduction on cache hits for GPT-5 series	~5-10 min idle, up to 1h; Extended caching up to 24h (default 24h on gpt-5.5)	Automatic, no code change; activates at 1,024 tokens, grows in 128-token steps (OpenAI)
Google (Gemini / Vertex)	~75% reduced input price on cached content	Default 1 hour, configurable	Implicit caching automatic; explicit caching min ~32,768 tokens, model-dependent
Self-hosted (vLLM, SGLang)	Cache hit skips recompute entirely; you pay only the GPU you own	Until LRU eviction under memory pressure	Automatic prefix caching on by default in recent vLLM; full-block, hash-based reuse

Two things in that table trip people up. First, Anthropic charges you to write the cache, while OpenAI's automatic caching does not bill a separate write fee. That changes the break-even math: with Anthropic you are betting the prefix gets reused enough to earn back the 1.25x or 2x write premium, whereas with OpenAI the cache is free upside when it hits. Second, the OpenAI discount is not one number. The original GPT-4o announcement was a flat 50% on cached input; the current GPT-5 series quotes up to roughly 90% cost reduction on cache hits. Quote the model you are actually running, not the headline.

When prompt caching actually pays off

The payoff is entirely about hit rate, and hit rate is a property of your traffic, not of the feature being on. Caching a prefix that never repeats costs you the write fee and returns nothing. Caching a 5,000-token system prompt that rides on ten thousand requests an hour is close to free money. Before you turn anything on, ask one question: what large chunk of my prompt is identical across many calls inside the TTL window?

On Anthropic the break-even is concrete and small. A 5-minute cache write costs 1.25x the input price; a read costs 0.1x. So the cache pays for itself after a single hit on the 5-minute TTL, or after two hits on the 1-hour TTL whose write costs 2x (Anthropic pricing). For Claude Sonnet 4.6, that is a 5-minute write at $3.75 per million tokens versus reads at $0.30 per million against the $3.00 base. If your prefix gets read even twice before it expires, you are ahead.

Here is the shape of it in practice, with illustrative numbers rather than any specific client's bill. Take a support assistant with a 4,000-token stable system prompt, running on Sonnet 4.6 at 200,000 calls a month.

# illustrative, not a specific system

prefix_tokens 4000

calls_per_month 200000

# no cache: pay full input on the prefix every call

no_cache 4000 * 200000 * $3.00/M = $2,400/mo

# with cache: ~1 write per 5-min window, rest are reads

writes ~negligible share at $3.75/M

reads 4000 * ~200000 * $0.30/M = ~$240/mo

prefix_savings ~$2,160/mo # ~90% on the cached portion

That is the cached portion only; you still pay full price on the variable tail and the output. But the prefix is usually where the bloat lives, which is why this lever moves the bill. Agent loops are the strongest case I see in production: a multi-step agent re-sends its entire tool definitions, instructions, and growing scratchpad on every step, and almost all of it is an identical prefix step to step. Long-document Q&A, multi-tenant SaaS with a shared system prompt, and repo-wide code assistants all share that shape. Reported hit rates on workloads like these commonly land in the 60 to 85 percent range, which is the difference between a feature you can afford and one you cannot.

The pitfalls: cache misses, TTL, and a prefix that will not sit still

The number one reason teams turn on prompt caching and see no savings is a prefix that will not sit still. A timestamp in the system prompt, a per-request ID injected near the top, a user name interpolated before the instructions, a tools list that reorders itself: any of these changes the prefix and zeroes your hit rate while you keep paying write fees. The fix is mechanical. Move every stable token to the front and every variable token to the back, and verify with the usage fields that reads are actually happening.

TTL is the second trap. A 5-minute window is generous for a busy endpoint and useless for a sleepy one. If your traffic arrives in bursts with long gaps, your cache is cold by the time the next request shows up, and you paid the write for nothing. This is where the longer TTLs earn their premium: Anthropic's 1-hour cache costs 2x to write but survives the quiet stretches, and OpenAI's extended retention now defaults to 24 hours on gpt-5.5. Match the TTL to your real inter-request gap, not to a default.

The third trap is silent eviction. On self-hosted vLLM or SGLang, your warm prefix can be pushed out of GPU memory by other traffic under the LRU policy, so a cache you measured as warm at noon is cold during the afternoon peak. On hosted APIs the equivalent is that nothing guarantees your entry is still there inside the TTL; it is best effort. Instrument the cache-hit fields in the response and watch hit rate as a live metric, because a cache you assume is working and is not is worse than no cache: you are paying write fees for misses.

A cache you assume is working and is not is worse than no cache. You are paying the write fee on every miss and seeing none of the read discount.

There is also an isolation dimension that matters in multi-tenant products. You do not want tenant A's cached prefix served to tenant B, and providers handle this with workspace-level or organization-level isolation that you should confirm rather than assume. On self-hosted stacks, the hash includes salts precisely so that two tenants with identical text do not collide. Watching hit rate per tenant is the kind of thing that belongs in AI observability and monitoring, not a one-off spreadsheet, because it drifts as your prompts and traffic change.

Prompt caching vs semantic caching

This is the distinction I see conflated most, and getting it wrong wastes both effort and money. Prompt caching is exact: it reuses the model's computation for an identical prefix and is invisible to your answer quality, because the model still runs, it just skips re-reading what it already read. Semantic caching is approximate: it matches a new query to a stored one by meaning and returns the stored answer, skipping the model entirely.

The trade-offs are opposite. Prompt caching has essentially no quality risk and a modest, reliable saving on repeated context. Semantic caching can save far more, because it skips inference altogether on a hit, but it carries a real risk of returning a stale or subtly wrong answer when two questions mean almost but not quite the same thing. One is a cost optimization with no downside to reason about; the other is a cost optimization you have to evaluate like a feature.

In a mature pipeline you use both, at different layers. Semantic caching sits in front and answers the genuinely repeated questions without touching the model. Prompt caching sits behind it and cuts the cost of the requests that do reach the model by reusing their stable context. Pair both with the other levers in the pillar, token optimization to shrink what you cache and model routing to send each request to the cheapest model that clears the bar, and the cost curve bends well before you ever shop for a cheaper model.

Frequently asked questions

What is prompt caching in simple terms? It is reusing the part of a prompt the model already processed so you do not pay full price to feed the same tokens again. You mark a stable prefix as cacheable, and later requests that begin with the exact same tokens read it back at a deep discount instead of being reprocessed from scratch.

When does prompt caching save money? When a large, stable chunk of your prompt repeats across many requests inside the cache window: a long system prompt, shared documents, few-shot examples, or conversation history. It saves nothing on cold, one-off prompts, and on hosted APIs that charge a write fee it can cost slightly more if the prefix is never reused.

What is the difference between prompt caching and prefix caching? They are the same idea. "Prefix caching" is the name used on self-hosted stacks like vLLM and SGLang, where the cache works on the literal prefix of your tokens. "Prompt caching" is the product name the hosted vendors use for the same exact-prefix reuse.

Why is my prompt cache not getting any hits? Almost always because a variable token sits inside the prefix and breaks the exact match: a timestamp, a request ID, a user name, or a reordered tools list near the top of the prompt. Move all stable content to the front and all variable content to the end, then confirm reads are happening in the response usage fields.

If you want the full map of where caching fits among the other cost levers, that is my guide to reducing LLM inference cost, and the economics of caching context against long windows is the subject of my book The Context Window Problem. If you would rather have a team instrument hit rate and build caching into your stack from day one instead of after the bill, that is what Devlyn's observability work is for. Cache the tokens that repeat. Pay full price only for the ones that change.

LLM Model Routing: Cheapest Model That Can Do the Job

Alpesh Nakrani — Mon, 11 May 2026 18:30:00 GMT

LLM model routing sends each request to the cheapest model that can handle it, escalating only when needed. Here is how it cuts cost without cutting quality.

LLM model routing is the practice of putting a decision layer in front of a pool of models so each request goes to the cheapest one that can answer it acceptably, and only the genuinely hard requests pay frontier prices. It cuts cost because most production traffic is easy, the price gap between a cheap model and a frontier model is roughly 100x per token in 2026, and you stop paying the top rate on the 80% of requests that never needed it.

I run revenue at Devlyn, and I have signed off on enough inference invoices to tell you where the money actually goes. It does not go to the hard problems. It goes to running a model that costs $25 per million output tokens on requests a model that costs $5 could have answered the same way. Routing is how you stop doing that. This piece is the routing chapter of the larger cost story; for the full picture, start with my guide to LLM inference cost.

Most of your inference bill is not the cost of solving hard problems. It is the cost of using a frontier model on easy ones.

Key takeaways

If you read nothing else, these are the load-bearing claims:

Model routing pays frontier prices only for the hard tail. A decision layer sends easy requests to cheap models, so you stop overpaying on the bulk of traffic that never needed the expensive model.
The savings are large and well documented. The RouteLLM work reported roughly 85% cost reduction on one benchmark while keeping 95% of GPT-4 quality, using the strong model on only 14% of queries.
A cascade and an upfront router are different bets. A cascade tries the cheap model first and escalates on failure; an upfront router classifies the request and picks the model before generating. Each fails differently.
Every router is an eval problem in disguise. You route on a quality or confidence signal, and if that signal is wrong, the product degrades quietly while the dashboard looks fine.
The cheapest rule that captures most of the savings beats the elegant router you cannot debug. Crude routing logic is usually fast, cheap, and good enough.

What LLM model routing actually is

Picture your application as something that talks to a menu of models rather than one model. The menu spans a frontier model, a mid-tier model, a cheap fast model, and maybe a small self-hosted model you control. Routing is the policy that picks one of them per request. That is the whole idea, and it is worth understanding before you reach for any product that sells it.

The reason routing works is that production traffic is lopsided. A support assistant gets a hundred "where is my order" questions for every one that needs real reasoning. A coding tool gets a flood of small completions and a handful of genuine architecture questions. If you send all of it to a frontier model, you pay the top rate on every request, including the ones a far cheaper model would have handled identically.

The price gap is what makes this matter. As of 2026, the spread between the cheapest capable models and the most expensive frontier models runs to roughly 100x on input tokens. Anthropic's published rates put Claude Haiku 4.5 at $1 per million input tokens and Opus 4.8 at $5, with output tokens at $5 and $25 respectively (Anthropic pricing). Once a gap that large exists, sending an easy request to the expensive model is not a rounding error. It is the bill.

Routing is the cheap leg of the same strategy I make the case for in shipping smaller models. The small model handles the bulk; the router decides when the bulk is not enough. If you are early in the cost work, routing pairs naturally with prompt caching, which attacks the same bill from the input side. Want a team to build the routing layer rather than bolt one on later? That is what the Devlyn engineering team does.

The routing strategies that matter, and how each decides

There are four routing strategies you will actually see in production, and they differ mainly in how the decision gets made and when. Understanding the difference is the difference between a router you can reason about and one that surprises you in a board meeting.

Rules. The simplest router is a set of hand-written conditions: requests over a token threshold go to the big model, requests matching a known cheap pattern go to the small one. Rules cost essentially nothing to run, under a millisecond, and you can read them. They are crude, they miss nuance, and they are the right place to start because they capture a surprising amount of the savings before you have built anything fancy.

Cascade. A cascade runs the cheap model first, checks the output against a confidence or verification signal, and escalates to a bigger model only when the cheap one fails the check. Most requests never escalate, so you pay frontier prices only on the tail. The academic survey on dynamic routing and cascading frames this as the sequential-escalation pattern, distinct from upfront routing (arXiv survey). The catch: a cascade that escalates often pays for two calls on the same request.

Classifier or predictive router. Here a lightweight model looks at the incoming request and predicts which model should handle it, before any generation happens. The RouteLLM work trained such routers on preference data and reported that a matrix-factorization router hit 95% of GPT-4 quality while sending only 14% of queries to the strong model (LMSYS). This is the most powerful approach and the most work to maintain.

Semantic router. A semantic router embeds the request and matches it against clusters of known query types, routing by meaning rather than rules. It sits between hand rules and a trained classifier on both cost and capability. Embedding-based routing adds roughly 5ms; a heavier ML classifier adds 50 to 100ms, against typical LLM response times of 500 to 2000ms, so even the expensive routers are a single-digit percentage of the total call (DigitalApplied).

A routing-strategy table you can paste into a deck

Here is the same set of strategies side by side: how each one decides, the kind of savings it tends to deliver, and what it puts at risk. Sourced figures are marked; the rest are illustrative ranges from typical deployments.

Strategy	How it decides	Typical savings	Main risk
Rules	Hand-written conditions on length, pattern, or metadata	20-40% (illustrative)	Misses nuance; brittle as traffic shifts
Cascade	Cheap model first; escalate on confidence or verification failure	40-85% (RouteLLM: ~85% on MT-bench)	Double-billing when escalation is frequent
Classifier / predictive	Trained model picks the target before generating	45-85% (RouteLLM: 45% on MMLU)	Drift; needs labeled data and retraining
Semantic	Embed request, match to known query clusters	30-60% (illustrative)	Mis-clusters novel or ambiguous requests

The RouteLLM numbers are real and worth internalizing: roughly 85% cost reduction on MT-bench, 45% on MMLU, and 35% on GSM8K versus a frontier-only baseline, all while retaining about 95% of GPT-4 quality (LMSYS). The variation across benchmarks is the real lesson: routing savings depend entirely on how easy your actual traffic is, so your number is your own, not the paper's.

Build or buy your router

The market is full of managed LLM router products and gateways that promise to do this for you, and they are not wrong to. The question is whether the convenience is worth giving up the thing you most need, which is the ability to understand and debug why a request went where it went.

Build when your routing logic is simple enough to own. A rules layer or a cheap-first cascade with a confidence threshold is a few hundred lines of code and a config file. You can read it, test it, and change it without a vendor relationship. For most teams starting out, this is the right call, because the crude version captures most of the savings and teaches you what your traffic actually looks like.

Buy when routing has become its own discipline. If you are running a trained classifier that needs labeled data, retraining, and monitoring, a managed router that maintains the model and the eval loop can be cheaper than a half-time engineer doing it badly. The honest version of build-vs-buy is a cost-of-ownership question, not a feature comparison. Count the engineer hours, not the API line item.

The pattern I have watched succeed: start with a rule, graduate to a cascade, and only reach for a trained or managed router once you have proof the simpler version is leaving real money on the table. I walk through the full decision space in my book on model routing, including why the model's own confidence is a worse escalation signal than people expect.

A crude rule you can debug at 2am beats an elegant router that routes wrong and never tells you.

The eval problem hiding inside every router

Here is the part most routing guides skip, and it is the part that decides whether routing helps or quietly hurts. Every router decides based on a signal: a confidence score, a classifier's prediction, a verification check. That signal is a claim about quality. If the claim is wrong, routing sends hard requests to the cheap model and the product gets worse while every cost dashboard turns green.

You cannot route on quality you cannot measure. Before a router is trustworthy, you need a frozen, production-sampled eval set that tells you, per model, how often each one is actually right on your traffic. That is the same machinery I describe in my guide to LLM evaluation, and it is not optional for routing. It is the foundation the router stands on.

The specific trap is the cheap path going untested. Teams evaluate the frontier model carefully, ship the router, and never run the same eval on the cheap model the router now sends most traffic to. The result is a system optimized for a cost number nobody connected back to a quality number. Pick the right harness with help from my rundown of LLM evaluation tools, then run every model in the menu against the same frozen set.

A useful reframe: routing is not a cost feature with an eval attached. It is an eval system that happens to save money. Get the eval right and the savings are safe. Get it wrong and you are gambling with the product's quality to shave a bill.

Where routing quietly goes wrong

Routing failures are rarely loud. The system keeps responding, the cost line keeps dropping, and the damage shows up in churn and support volume weeks later. These are the failure modes I watch for.

The confidence signal lies. A small model is often most confident exactly when it is wrong, so escalating on low confidence can leave the worst answers unescalated. Test the signal, do not trust it.
Double-billing on escalation. A cascade that escalates 40% of the time pays for two calls on those requests. If escalation is common, an upfront router that picks once can be cheaper than a cascade that picks twice.
Drift. A classifier trained on last quarter's traffic routes this quarter's traffic worse every week. Without monitoring, you find out from the quality numbers, late.
Routing overhead ignored or overstated. A trained classifier adds 50 to 100ms; on a p95-sensitive flow that can matter, and on a batch job it is noise. Know which one you are running before you optimize the wrong thing.

None of these are reasons not to route. They are reasons to instrument the routed pipeline so the failures are visible the day they start, not the month the revenue dips. Watching a routed system in production is squarely an AI observability and monitoring problem, not a one-time setup.

Two short stories, with numbers

The numbers below are illustrative, drawn from the shape of real deployments rather than any specific client system, and they are NDA-safe.

The cascade that paid for itself in a week. A team running a support assistant entirely on a frontier model was spending roughly $18,000 a month at about 600,000 calls. They added a cheap-first cascade: the small model answered, a confidence check decided whether to escalate. About 78% of requests resolved on the small model. The bill fell to near $5,000, and the held-out eval showed no measurable quality drop on the routed traffic. The router was a config file and a threshold.

The classifier that quietly broke. A different team shipped a trained classifier router, saw a 60% cost drop, and celebrated. Three weeks later, support tickets climbed. The classifier had been trained before a product launch changed the traffic mix, and it was now sending a new category of hard questions to the cheap model with high confidence. Nobody had re-run the eval after launch. The fix was not a better router. It was a frozen eval set that ran on every model weekly, which would have caught the drift in days instead of weeks.

The operator's frame

I will close where I started, on the revenue. Every AI feature has a cost curve and a value curve, and your job is to widen the gap between them. Routing widens it on the cost side without touching the value side, as long as the eval holds. That is rare and worth doing well.

The mistake is treating routing as a feature you turn on. It is infrastructure you operate. The router, the eval set behind it, and the monitoring on top of it are a system, and the system is what produces durable margin. Plugging in a managed router and walking away gets you the demo. Owning the eval loop gets you the margin that survives a traffic shift.

The teams that internalize this build a real cost advantage that compounds. Routing is one lever; it sits alongside smaller models, caching, and the rest of the toolkit in my guide to LLM inference cost. And routing inside an agent loop, where a single task fans out into many model calls, is where the savings get largest and the eval problem gets hardest, which I get into in my piece on the best AI agents.

Frequently asked questions

What is LLM model routing?

LLM model routing is a decision layer in front of a pool of models that sends each request to the cheapest model capable of answering it acceptably, escalating to a more expensive model only when needed. It cuts cost because most production traffic is easy and the price gap between cheap and frontier models is roughly 100x per token, so you stop overpaying on the bulk of requests.

How much does model routing actually save?

It depends on how easy your traffic is, but the documented range is large. The RouteLLM work reported about 85% cost reduction on one benchmark while keeping 95% of GPT-4 quality, using the strong model on only 14% of queries. Production deployments commonly land in the 40 to 85% range. Your number is your own, because it is set by your traffic mix, not the paper's.

What is the difference between a cascade and a router?

A cascade runs the cheap model first and escalates to a bigger model only when the cheap one fails a confidence or verification check, so it can pay for two calls on hard requests. An upfront router classifies the request and picks one model before generating, paying for a single call but needing a trained classifier and labeled data. Cascades are simpler to start with; upfront routers can be cheaper when escalation would otherwise be frequent.

What is the biggest risk with model routing?

Routing on a quality signal you have not measured. If the confidence score or classifier is wrong, the router sends hard requests to the cheap model and the product degrades while the cost dashboard looks healthy. The defense is a frozen, production-sampled eval set run against every model in the menu, plus monitoring to catch drift before the revenue numbers do.

If you want a team to build the routing layer, the eval set behind it, and the monitoring on top, that is exactly what Devlyn's engineering team works on. The cheapest model that can do the job is a great strategy. The eval that proves it can is what makes the strategy safe.

LLM Quantization: When 4-Bit Pays (and When It Bites)

Alpesh Nakrani — Sun, 10 May 2026 18:30:00 GMT

LLM quantization stores a model at fewer bits per weight, cutting memory and cost. The trade-off: quality holds on most tasks and quietly breaks on a few.

LLM quantization is the practice of storing a model's weights (and sometimes its activations) at lower numerical precision, dropping from 16-bit floats down to 8-bit or 4-bit integers, so the model takes less memory and runs cheaper. The cost is quality: most of the time the drop is small enough to hide inside the noise of your eval suite, and some of the time it is large enough to break the feature in production. The whole game is knowing which task you have.

I run revenue at Devlyn, where we ship customer-facing AI into a retail eyewear experience that touches real people in stores. I came up as an engineer, and I still read the inference bill line by line. Quantization is one of the few levers that moves the bill by a multiple rather than a few percent, which is exactly why it deserves more rigor than it usually gets. Most teams either refuse to quantize out of superstition or quantize everything and ship a quietly degraded model. Both are expensive mistakes.

This is a supporting piece in my guide to LLM inference cost. Quantization sits next to model sizing and prompt caching as one of the three biggest cost levers you control, and it is the one teams understand the least.

Key takeaways

Quantization is a margin lever, not a quality compromise, as long as you quantize the tasks where the quality drop falls inside your eval noise and leave the ones where it does not.
8-bit is close to free; 4-bit is a real trade. INT8 and FP8 are near-lossless on most tasks. At 4-bit you save 4x the memory and pay a real, measurable quality cost on hard tasks.
4-bit is fine for summarization, classification, and extraction, and genuinely costly for math, multi-step reasoning, and long-context faithfulness. The bit-width is not a single decision across your whole product.
The kernel matters as much as the format. The same 4-bit weights can run 10x faster or slower depending on which inference kernel serves them.
Quantization is what makes self-hosting pay. Fitting a 70B model on one GPU drops the break-even point against an API from "never" to "sooner than you think."

What LLM quantization actually is, and what you trade

A model weight is just a number. Train a model in the usual way and each of those numbers is a 16-bit float, which is the default precision most modern LLMs ship in. Quantization replaces those 16-bit numbers with lower-precision ones: 8-bit integers, 4-bit integers, or 8-bit floats. Fewer bits per number means a smaller model in memory and less data to move per token, and moving data is most of what inference actually spends time on.

There are two things you can quantize, and the distinction drives everything downstream. Weight-only quantization compresses the stored weights but runs the math in higher precision (often written W4A16, meaning 4-bit weights and 16-bit activations). Weight-and-activation quantization compresses both (W8A8 is the common one). Weight-only is easier to do without hurting quality, because activations carry outliers that low precision handles badly. That is why most 4-bit deployments are weight-only.

The other split is when you quantize. Post-training quantization (PTQ) takes a finished model and compresses it using a small calibration dataset, in minutes to hours, with no retraining. Quantization-aware training (QAT) bakes the low precision into the training loop, which recovers more quality but costs a full training run. For almost every team I talk to, PTQ is the right starting point, and the methods below are all PTQ.

The trade you are making is precision for size. Less precision means each weight is a coarser approximation of what training produced. On easy, redundant tasks that coarseness washes out. On tasks that depend on a long chain of exact intermediate steps, the errors compound. The skill is not "should I quantize," it is "which precision survives this specific task," and that is an empirical question you answer by measuring, not by reading a blog post, including this one.

The methods that matter: GPTQ, AWQ, GGUF, and FP8

Four names cover most of what you will actually deploy. They are not interchangeable, and the differences are practical rather than academic.

GPTQ is a one-shot post-training method that uses approximate second-order information to decide how to round each weight while minimizing the error it introduces. The original work showed it could quantize a 175-billion-parameter model down to 3 or 4 bits in roughly four GPU hours with negligible accuracy loss relative to the full-precision baseline (Frantar et al., ICLR 2023). It is fast to produce and well supported. It tends to lose a little more on code and reasoning than the alternatives, for reasons I will get to.

AWQ (Activation-aware Weight Quantization) starts from a sharp observation: not all weights matter equally. Roughly 1% of weight channels are salient, identified by looking at the activation distribution rather than the weights themselves, and protecting just those channels before quantizing the rest sharply reduces error. It uses no backpropagation and no reconstruction, so it preserves the model's general behavior instead of overfitting to the calibration set (Lin et al., MLSys 2024, which won that conference's best paper award). In practice AWQ holds quality at 4-bit a bit better than GPTQ.

GGUF is the format from the llama.cpp project, and it is what you reach for when you are serving on CPU, on mixed hardware, or on Apple Silicon, where it is effectively the only practical option. Its K-quant variants (Q4_K_M and friends) give you fine-grained control over the quality-size trade by storing more scale data per block. If your deployment is a laptop, an edge device, or a heterogeneous fleet, GGUF is usually the answer.

FP8 is the newer entrant and a different idea. Instead of integers it uses 8-bit floating point (the E4M3 and E5M2 formats), which keeps a wider dynamic range than INT8 and so handles the outlier activations that wreck integer quantization. On the latest hardware FP8 is close to lossless for both weights and activations, which makes it the cleanest way to get an 8-bit speedup without an eval babysitting it. If your GPUs support it natively, FP8 is often the easiest win in this whole list.

Quality by bit-width: where 4-bit is free and where it bites

Here is the part teams get wrong: they treat bit-width as one decision for the whole product, when it is really a decision per task. The same 4-bit model that is indistinguishable from full precision on one job is visibly worse on another.

At 8-bit the trade barely exists. FP8 is essentially lossless, and INT8 with sensible tuning lands within about 1 to 3% of the full-precision model on most tasks. If you are nervous about quantization and want the safe first step, quantize to 8-bit and move on. You get roughly half the memory for a quality cost you will struggle to measure.

At 4-bit the trade is real but narrow. Across the common 4-bit formats, perplexity stays within about 6% of the 16-bit baseline, which sounds reassuring until you look at task-specific numbers. On HumanEval, a code-generation benchmark, one widely-cited comparison put full precision at 56.1% pass@1, with AWQ and GGUF Q4_K_M both holding 51.8% and GPTQ falling to 46% (The AI Engineer, 2026). A four-to-ten-point drop on code is not noise. On a summarizer you would never see it.

That pattern is the rule, not the exception. Four-bit quantization is close to free on summarization, classification, extraction, and retrieval-grounded answering, where the model has slack and the task tolerates approximation. It bites on math, multi-step reasoning, code generation, and long-context faithfulness, where each step depends on the last and a coarse weight throws the chain off. So the operative question is never "is 4-bit good enough," it is "is 4-bit good enough for this task," and the answer changes across the surfaces of a single product.

Bit-width is not one decision for your whole product. It is a decision per task, and the same 4-bit model can be flawless on one job and visibly broken on the next.

At Devlyn this plays out concretely. Our intent classifier and our preference-extraction step run quantized at 4-bit and we cannot find the quality difference in our evals. The step that reasons over a prescription and a set of frame constraints stays at higher precision, because when we quantized it as an experiment we lost a few points on the numeric sub-eval and that step has to be right in front of a customer. Same product, two different precisions, one eval suite that told us where the line was. Those numbers are illustrative of the shape, not a published benchmark, but the decision process is exactly what I would defend in a review.

The memory and throughput math

The memory story is the easy one. Going from 16-bit to 4-bit cuts model size by roughly 4x; 8-bit cuts it by about 2x. A 70-billion-parameter model that needs well over 100GB in full precision compresses to a footprint that fits on a single 80GB GPU at 4-bit. That single fact, "it fits on one card," is what turns a lot of self-hosting math from impossible to routine, because you stop paying for multi-GPU sharding and the interconnect headaches that come with it.

Throughput is more interesting and more misunderstood. Quantization speeds up inference because you move less data per token, but how much speedup you get depends heavily on the inference kernel, not just the format. In one H200 benchmark, the exact same AWQ weights served 68 tokens per second under a default vLLM path and 741 tokens per second under an optimized Marlin kernel, a roughly 10x difference from kernel choice alone, with the optimized path running about 1.6x faster than the FP16 baseline (The AI Engineer, 2026).

The lesson there is one I have learned the expensive way: if you quantize, benchmark, and conclude "quantization didn't speed anything up," you probably benchmarked the kernel, not the format. The format determines the ceiling. The serving stack determines whether you reach it. Most disappointing quantization results are serving problems wearing a quantization costume.

When self-hosting a quantized model beats an API

This is the section that actually moves money, so I want to be honest about both sides of it. The reason quantization matters to a CRO is that it changes the break-even point between paying a hosted API per token and running your own model on your own GPUs.

The arithmetic is roughly this. A dedicated H100 runs about $2 to $2.80 per hour on demand, which is $1,800 to $2,000 a month if you keep it busy around the clock. Quantization lets a strong open model fit on that one card. Against frontier API pricing, self-hosting tends to break even somewhere in the low millions of tokens per day; one widely-cited comparison had a 4-bit Llama 70B setup costing around $4,360 a month at 500 million tokens a day versus roughly $22,500 on an API, about a 5x swing in self-hosting's favor at that volume (sourced from current self-host-versus-API cost analyses, 2026). Below that volume, the API usually wins.

Now the honest counterweight, because the per-token math is a trap. Self-hosting is not just the GPU bill. Realistic estimates put ongoing maintenance, monitoring, and incident response at 10 to 20 engineering hours a month, which at senior rates is another $750 to $3,000, and the all-in cost of a self-hosted deployment commonly runs 3 to 5x the raw GPU price once you count it all. A break-even calculation that ignores the engineer is a break-even calculation that will be wrong in the direction that hurts.

The per-token math says self-hosting wins at volume. The all-in math, including the engineer who babysits it, moves the break-even point and is the only version worth presenting to a board.

So the real rule is: self-hosting a quantized model pays when you have sustained high volume, a latency or data-residency requirement that an API cannot meet, and the engineering capacity to operate it. If you are below a few million tokens a day, or you cannot staff the operations, the API is cheaper even though the per-token number looks worse. This is the same outcomes-over-velocity discipline I apply to model sizing in the case for shipping smaller models: the cheapest token is the one you serve on infrastructure you can actually run.

If you want this break-even modeled against your real traffic and your real quality bar rather than a blog post's averages, that is the kind of work the Devlyn engineering team does on inference economics.

A comparison you can paste into a deck

Here is the practical decision in one table: the method, the bit-width it is usually run at, the quality you typically keep, and where it fits. Quality figures are directional, drawn from the public benchmarks cited above, not a guarantee for your model.

Method	Typical bit-width	Quality retention	Best use case
FP8 (E4M3)	8-bit float	Near-lossless	The safe default on FP8-capable GPUs
INT8 (W8A8)	8-bit int	~1-3% drop	Conservative speedup, broad task safety
AWQ	4-bit, weight-only	~99% on most tasks	4-bit when you want to protect quality
GPTQ	4-bit, weight-only	Within ~6% perplexity; weaker on code	Fast to produce, GPU serving at scale
GGUF (Q4_K_M)	4-bit, weight-only	~92% of perplexity	CPU, edge, mixed fleets, Apple Silicon

If you take one thing from the table: start at 8-bit when in doubt, reach for AWQ when you need 4-bit on a GPU and care about quality, and use GGUF when your hardware is not a data-center NVIDIA card. Everything else is tuning.

The pitfalls nobody puts on the slide

Quantization fails in quiet, specific ways, and the failures rarely show up in the headline accuracy number. These are the ones that have cost me time.

Calibration-set mismatch. PTQ methods tune the quantization using a small calibration dataset. If that data does not resemble your production traffic, the model is optimized for the wrong distribution and degrades on exactly the inputs you care about. Calibrate on your data, not on whatever sample shipped with the tool.

The KV cache is a separate decision. Quantizing the weights does not quantize the key-value cache, which is its own large and growing chunk of memory at long context. Teams quantize weights, celebrate the memory win, then hit an out-of-memory wall on long conversations because the cache was never touched. Treat KV-cache quantization as its own line item with its own quality check.

You must evaluate the quantized model, not the original. The most common mistake I see is running the eval suite on the full-precision model, approving it, and then deploying the quantized one. You validated a model you are not shipping. Quantization is a model change, so it goes through the same gate as any other model change, on a frozen, production-sampled set. If you do not have that gate yet, build it first; my guide to LLM evaluation covers how, and it is the prerequisite for quantizing safely.

Kernel and hardware lock-in. The fast path for a given quantized format often depends on a specific kernel and a specific GPU generation. A format that flies on an H200 may crawl on older hardware or in a different serving stack. Pin down the format-plus-kernel-plus-hardware combination before you commit, because the quality and the speed both depend on all three.

None of these are reasons not to quantize. They are reasons to quantize with a measurement loop attached. Quantization without evals is just hoping the trade went your way, and hope is not a cost strategy. If you want the deeper treatment of sizing the model down before you compress it, the companion read is my book on small models, and the question of whether to fine-tune the quantized model or leave it alone is its own decision I work through in To Fine-Tune or Not.

Frequently asked questions

What is LLM quantization in simple terms? It is storing a model's numbers at lower precision, dropping from 16-bit floats to 8-bit or 4-bit, so the model uses less memory and runs cheaper. You trade a small amount of accuracy for a large reduction in size and cost. On most tasks the accuracy you lose is too small to matter; on math and reasoning tasks it can be large enough to matter a lot.

Does 4-bit quantization hurt model quality? Sometimes, and it depends entirely on the task. On summarization, classification, and extraction the quality drop is usually inside your eval noise. On code, multi-step reasoning, and long-context faithfulness it is measurable, often a few points on the relevant benchmark. The only honest answer comes from running your own eval suite on the quantized model.

Which is better, GPTQ or AWQ? For 4-bit GPU serving where you care about quality, AWQ usually holds accuracy slightly better because it protects the small set of salient weight channels that matter most. GPTQ is fast to produce and serves well at scale but tends to lose a bit more on code and reasoning. If you are on CPU or Apple Silicon, neither applies and you want GGUF.

When is self-hosting a quantized model cheaper than an API? When you have sustained high volume (commonly in the low millions of tokens per day against frontier pricing), plus a latency or data-residency need an API cannot meet, plus the engineering capacity to operate the stack. Below that volume, or without the operations capacity, the per-token API price usually wins once you count the real cost of running your own infrastructure.

Quantization is one of the clearest examples of the principle that runs through all of my inference-cost work: the model is an input to a system you design, not the system itself. Measure the trade, quantize the tasks that survive it, and keep full precision where the chain has to hold. If you would rather have a team instrument the evals and model the break-even against your own traffic, that is what Devlyn's observability and monitoring work is built for. Compress what you can measure. Leave alone what you cannot.

Semantic Caching for LLMs: When It Saves Money

Alpesh Nakrani — Sat, 09 May 2026 18:30:00 GMT

Semantic caching reuses a stored LLM answer when a new question means roughly the same thing as one you have already answered, even if the wording is completely different. It saves money when your traffic is full of paraphrases of a small set of questions, because a cache hit returns in milliseconds at near-zero cost instead of paying for a fresh model call. The part most people get wrong is that it is not the same as exact prompt caching, which only fires when the prefix of the prompt is byte-for-byte identical: exact caching matches strings, while semantic caching matches meaning, and matching meaning is exactly where the money and the danger both live.

I have sat in the billing review where a support assistant was answering the same forty questions ten thousand times a day, each phrased a little differently, and we were paying full price for every single one because nothing matched exactly. Semantic caching is the lever for that shape of traffic. It is also the lever I have watched quietly serve a customer the wrong answer because someone set the similarity threshold by feel. This piece is the deep dive on that lever from my guide to reducing LLM inference cost, and it sits right next to the exact-match version in my piece on prompt caching, which the two are constantly confused with because they solve different problems.

Exact caching matches strings. Semantic caching matches meaning. The first is safe and narrow. The second is powerful and will lie to you if you let it.

Key takeaways

If you read nothing else, these are the load-bearing claims:

Semantic caching reuses answers for similar questions, not identical ones. It catches the paraphrases that exact prompt caching misses, which is where the volume hides in support and FAQ traffic.
The similarity threshold is a revenue dial, not a config value. Too loose and you serve confidently wrong answers; too tight and the hit rate collapses and you save nothing.
A false cache hit is worse than a cache miss. A miss costs you one model call. A false hit costs you a wrong answer to a real customer, and you will not see it in the cost dashboard.
It only pays off on repetitive, paraphrase-heavy traffic. On a workload where every prompt is genuinely novel, semantic caching adds latency and embedding cost and returns nothing.

What semantic caching actually is

Semantic caching is a layer that sits in front of your model and asks one question before every call: have I already answered something that means this? If the answer is yes with enough confidence, it returns the stored response and never touches the model. If the answer is no, it calls the model as usual and saves the new answer for next time.

The word doing all the work is "means." A traditional cache keyed on the raw text would treat "how do I reset my password" and "I forgot my password, how do I change it" as two completely different requests, because the strings differ. A semantic cache treats them as the same request, because they carry the same intent. That is the whole pitch: you stop paying to re-answer questions you have already answered in slightly different words.

If your business runs on retrieval, this should sound familiar, because it is the same embedding machinery you already use for search. The cache is just pointing that machinery at your own past answers instead of your document corpus. I walk through the similarity foundations underneath all of this in my book on embeddings in production, and the retrieval discipline it depends on in the RAG book.

How embedding-similarity caching works under the hood

The mechanics are simpler than the name suggests. When a query arrives, the cache turns it into a vector embedding, a list of numbers that captures its meaning. It then searches the vectors of past queries for the closest match, usually with an approximate nearest-neighbor index like HNSW so the lookup stays fast even with millions of entries.

The closeness of the match is scored with cosine similarity, a number between 0 and 1 where 1 means identical meaning. The cache compares that score against a threshold you set: if the best match clears the threshold, the cache returns the stored answer, and if nothing clears it, you get a miss and the request goes to the model. Each stored answer also carries a TTL so stale facts expire instead of haunting you forever.

That single threshold is the entire risk surface of the system, and it is worth understanding before you ship anything. Redis, whose semantic cache is one of the more documented production implementations, recommends thresholds somewhere between 0.7 and 0.95, and tells you to start conservative at 0.9 or higher (Redis). The reason for that caution is the next section, and it is the part of semantic caching that has cost teams more than it saved them.

If you are early in building an AI product and want this kind of infrastructure designed correctly the first time rather than discovered the hard way, the Devlyn team builds exactly this.

The tools: GPTCache, Redis LangCache, and gateway caches

You rarely build a semantic cache from scratch, and you should not. The open-source default is GPTCache, a Python library from Zilliz that handles the embed-search-store loop for you, with pluggable embedding models and a choice of vector backends like Milvus, Faiss, Redis, and Qdrant. It is the thing most people mean when they say "semantic cache" in a codebase.

On the managed side, Redis ships a semantic cache (marketed as LangCache) that gives you the same loop with native vector search, TTL expiration, and HNSW indexing built in. Beyond those, a growing number of LLM gateways now bake semantic caching in as a config flag, so you get it without writing cache code at all. The tradeoff there is the usual one: less control over the threshold and the eviction policy in exchange for not maintaining it yourself.

The tool matters less than the discipline around it. A gateway flag with a default threshold and no measurement is how teams end up with the failure mode I am about to describe. Pick whichever fits your stack, then spend your real effort on the threshold and the guardrails, not the library.

Tuning the similarity threshold and the false-hit danger

The threshold is a precision-recall dial, and you cannot win both ends. Lower it and you catch more paraphrases, so your hit rate climbs and your bill drops, but you also start matching questions that are merely adjacent rather than equivalent. Raise it and every hit is trustworthy, but you miss the loosely-worded paraphrases and the savings evaporate. There is no setting that is loose and safe at the same time.

Here is the failure that should scare you. A team I worked with set a semantic cache at 0.83 because that was the number in a tutorial, and the hit rate looked great. Then a customer asked whether a product was covered under warranty after twelve months, and the cache served them the confident answer to a question about coverage after twelve days, because the two queries sat 0.84 apart in embedding space. The numbers are illustrative, but the shape is real: a wrong answer, delivered fast, with full confidence, invisible in every cost metric because it counted as a successful cache hit.

A cache miss costs you one model call. A false cache hit costs you a wrong answer to a real customer, and it will never show up in the cost dashboard. That asymmetry is the whole game.

The discipline that prevents this is boring and it works. Collect a few hundred real queries, label which ones should and should not share an answer, and measure precision at each candidate threshold until you clear about 95% precision before you ship (Redis). Add a confidence buffer so you only serve a hit when it beats the threshold by a comfortable margin, and fence the cache with hard metadata boundaries so a hit can never cross tenant, locale, or product line no matter how similar the wording looks. Soft similarity inside hard walls is the pattern that survives contact with production.

A comparison of caching approaches

Here is the same decision laid out as a table: how each approach matches, what it risks, and the traffic it is built for.

Approach	How it matches	Main risk	Best for
Exact response cache	Byte-for-byte identical query string	Misses every paraphrase, so hit rate is low on free text	Fixed, repeated queries (status checks, canned commands)
Exact prompt caching	Identical prompt prefix within a cache window	Saves nothing if the prefix is not stable across calls	Long stable system prompts, shared documents, few-shot blocks
Semantic caching	Embedding similarity above a threshold	False hits: serving a wrong answer to a near-miss query	Paraphrase-heavy traffic (support, FAQ, knowledge bases)
Two-layer (exact then semantic)	Exact first, fall back to similarity	More moving parts to monitor and tune	High-volume assistants where both shapes of traffic appear

The two-layer pattern is what most serious deployments converge on. You check for an exact match first, which is free and carries zero false-positive risk, and only run the more expensive and riskier semantic match when the exact layer misses. It gives you the safety of strings on the common path and the reach of meaning on the tail.

When semantic caching pays off, and when it does not

The economics are entirely a function of hit rate, and hit rate is entirely a function of how repetitive your traffic is. The published benchmark for GPT Semantic Cache reports hit rates of roughly 62% to 69% on support and FAQ-style query sets, with positive-hit accuracy above 97% once the threshold is tuned (arXiv). Those are the conditions where the lever is worth pulling: a narrow set of questions asked many different ways.

Translate that into money and it is real. Redis sketches a workload spending $80,000 per quarter on the model, where a 30% to 40% semantic cache hit rate saves $24,000 to $32,000 per quarter (Redis). The latency win is just as concrete: a cached answer can come back in around 0.3 seconds against the 2.7 seconds of a live call, because you skipped the model entirely. Faster and cheaper at the same time is rare, and on the right traffic this delivers it.

Now the honest other side. If every request to your system is genuinely novel, a code-generation tool taking unique repos, an analysis agent reasoning over fresh data each time, your semantic cache hit rate will sit near zero. You will pay for an embedding call and a vector lookup on every request and get almost nothing back. On novel traffic, semantic caching is pure overhead, and the right answer is to spend your cost effort on routing to a smaller model or on how you adapt the model instead.

The deciding question is not "would caching help" in the abstract. It is "what fraction of my real traffic is a paraphrase of something I already answered." Pull a week of production queries, cluster them, and look: if a small number of clusters cover most of your volume, semantic caching is a strong lever, and if your queries are a long flat tail of unique requests, skip it and pull a different one.

Semantic caching vs prompt caching

This is the distinction that derails the most conversations, so let me make it sharp. Exact prompt caching reuses the already-computed prefix of a single prompt, so when a long system prompt or a block of documents repeats across calls, you stop paying full input price to re-send those identical tokens. It matches on bytes, it never serves a wrong answer, and I cover it fully in the prompt caching deep dive.

Semantic caching reuses an entire answer for a different but similar question, and it matches on meaning rather than bytes. That reach is why it can save more on the right traffic, and it is also why it can be wrong in a way prompt caching never can. They are not competitors. A mature inference stack often runs both: prompt caching to cut the cost of the repeated prefix on every call, and semantic caching to skip the call entirely when the question has already been answered.

The clean mental model is a layered defense against spend. Exact response cache catches the literal repeats for free, prompt caching discounts the stable prefix on the calls you do make, and semantic caching skips the call entirely when meaning repeats. Routing then sends whatever is left to the cheapest model that can handle it. None of these replaces the others, and the order you reach for them in is most of the design.

Frequently asked questions

What is semantic caching for LLMs? Semantic caching stores past LLM answers alongside vector embeddings of the questions that produced them. When a new query arrives, it embeds the query, finds the most similar past question by cosine similarity, and returns the stored answer if the match clears a confidence threshold, skipping the model call entirely. It reuses answers for questions that mean the same thing, not just questions worded the same way.

How is semantic caching different from prompt caching? Prompt caching reuses the identical computed prefix of a prompt and matches byte-for-byte, so it never serves a wrong answer and saves money when a large stable chunk repeats across calls. Semantic caching reuses a whole answer for a similar question and matches on meaning, so it reaches further but can serve a wrong answer if the threshold is too loose. Most strong stacks run both.

What similarity threshold should I use for a semantic cache? Start conservative, around 0.9 or higher, then tune against a labeled set of a few hundred real queries until you clear roughly 95% precision before shipping. Add a confidence buffer so you only serve hits that beat the threshold by a margin, and fence the cache with hard metadata boundaries like tenant and locale so a hit can never cross them.

When does semantic caching not save money? When your traffic is genuinely novel and rarely repeats in meaning, the hit rate sits near zero and you pay for an embedding call and a vector lookup on every request for almost no return. On that shape of workload, semantic caching is overhead, and routing or model adaptation is the better cost lever.

If you want this built into your stack with the threshold tuning, false-hit guardrails, and the monitoring to catch a bad hit before a customer does, that is squarely an AI observability and monitoring problem, and it is the work the Devlyn team does on retrieval and knowledge systems. Cache the meaning that repeats. Measure the precision before you trust it. And remember that the cheapest wrong answer still costs more than the model call you were trying to skip.

LLM Token Optimization: Cut Token Cost, Keep Quality

Alpesh Nakrani — Fri, 08 May 2026 18:30:00 GMT

LLM token optimization is the work of reducing token usage on every call so the bill tracks the workload instead of your prompt habits. You cut tokens in two places: the input you send and the output you generate. Start with output, because on most frontier APIs output is priced 5x to 6x higher than input, so the cheapest token you will ever buy is the one you never make the model write (Claude pricing). After that, trim the input nobody reads, compress what you must send, and stop replaying context the model already saw.

I have sat in the billing review where the inference line outran revenue, and the reflex in the room is always to shop for a cheaper model. The model is rarely the problem. The problem is a 6,000-token system prompt nobody has read in months, an agent replaying its full transcript on every step, and a model narrating its reasoning for 800 tokens to answer a yes-or-no question. This piece is the deep dive on cutting that waste, one lever in the larger guide to reducing LLM inference cost, and the lever that pays back fastest because it costs you nothing in infrastructure.

You pay for every wasted token twice: once on the invoice, and once in the latency your user sits through.

Key takeaways

If you read nothing else, these are the load-bearing claims:

Output is the expensive token. Vendors price output 5x to 6x higher than input because each output token is a fresh sequential forward pass, so capping output is the single biggest-payoff cut.
Input trimming is free money. Pruning dead system-prompt text, redundant few-shot examples, and over-retrieved context cuts the bill with near-zero quality cost, because those tokens were never doing work.
Prompt compression pays off at scale, not by default. Tools like LLMLingua hit 2x to 20x compression, but they add a step and a quality risk, so they earn their place only on large, repeated prompts.
Agents leak tokens per task, not per call. A loop that replays its growing transcript turns a cheap per-call price into an expensive per-task one; summarize between steps instead of replaying.
Measure cost per resolved task, not cost per token. A compression that quietly drops quality is not a saving once a human has to finish the job.

Where your tokens actually leak

Before you cut anything, find the leak. Token cost has two faucets, and most teams only watch one. The first is input: every token in the system prompt, the few-shot examples, the retrieved context, and the replayed conversation history, all of which you pay for on every single call. The second is output: every token the model generates, priced several times higher than input.

The asymmetry is the whole game. As of mid-2026, Anthropic prices Claude Opus 4.5 at $5 per million input tokens and $25 per million output, Sonnet at $3 and $15, and Haiku 4.5 at $1 and $5 (Claude pricing). That is a 5x ratio in every line. The reason is mechanical: input tokens get processed in one parallel pass, while each output token requires its own full forward pass through the model, generated one at a time (CodeAnt, token economics).

So the order of attack writes itself. Cut output first, because each token there is worth five of the input kind, then trim input, because there is usually a lot of it and most of it is dead weight. If you are building an AI feature and want a team to instrument this properly rather than guess, that is the kind of work the Devlyn engineering team does day to day. The rest of this piece walks each faucet in turn.

Trim the input: system prompts, few-shot, and retrieval

Input trimming is the unglamorous lever that pays back fastest, because you are removing tokens that earned nothing. Industry write-ups in 2026 estimate that naive full-context and naive RAG pipelines run 3x to 5x higher token cost than the task actually requires (Redis, token optimization). That gap is almost entirely dead weight you can delete without touching quality.

There are three usual suspects. The system prompt grows by accretion: every incident adds a rule, nobody ever removes one, and a year later you are shipping 6,000 tokens of instructions the model half-ignores. Read it line by line and cut anything the model already does without being told. The second suspect is few-shot examples that a better instruction or a fine-tune has made redundant.

The third is over-retrieval. RAG systems default to a top-k that is too generous, stuffing eight chunks into context when two would answer the question. Retrieve less, but retrieve better: a tighter top-k with a reranker beats a wide top-k every time, on both cost and accuracy (Redis). If your retrieval quality is the bottleneck, that is a RAG and knowledge integration problem worth fixing at the source.

One honest trade-off lives here. Few-shot examples and a fat system prompt are often load-bearing on day one, and cutting them blind will tank your accuracy. The discipline is to cut against an eval set, not against a hunch, so you can see the quality move as the tokens drop. Whether to remove few-shot for good usually comes down to the RAG versus fine-tuning decision, since a fine-tune bakes the examples into the weights and lets you delete them from the prompt entirely.

Control the output: the most expensive token is one you never generate

This is where the money is, and it is the lever most teams skip. Because output costs 5x to 6x input, shaving 200 tokens off a response is worth more than shaving 1,000 off the prompt. Yet teams pour effort into the prompt and let the model ramble.

The first move is the bluntest: set max_tokens. It is a hard ceiling, and it stops a model that wants to write an essay from doing it on your invoice. Pair it with a length instruction in the prompt, something as plain as "answer in two sentences," so the model targets brevity instead of getting truncated mid-thought.

The second move is structured output. When you ask for JSON against a schema, the model stops writing prose preambles and apologies and just fills the fields. That alone cuts output, and it makes the response parseable instead of something you have to scrape. There is a small token overhead to the schema itself, but on any volume it pays for itself many times over.

The third move is format choice, and it is underrated. JSON is verbose: all those quotes, braces, and repeated keys are tokens you pay for on every row. For tabular data, a compact format like TSV or TOON can cut the serialized token count by a meaningful margin against the same JSON payload, with no loss of information. The model reads it fine, and your wallet reads it better.

Teams pour effort into the prompt and let the model ramble. Output is where the 5x lives. Cap it first.

Prompt compression: LLMLingua, and when it earns its keep

Prompt compression is the technique people reach for first and need least. The idea is clever: a small model rewrites or prunes your prompt to keep the information the big model needs while dropping the filler. Microsoft's LLMLingua is the reference implementation, a coarse-to-fine method with a budget controller that reaches up to 20x compression with minimal performance loss on its benchmarks (Microsoft LLMLingua).

The follow-up, LLMLingua-2, is task-agnostic and tuned for speed, running 3x to 6x faster than the original at the 2x to 5x compression ratios you would use in production (Microsoft LLMLingua). For long-context RAG specifically, the LongLLMLingua variant reports improved retrieval quality using roughly a quarter of the tokens. The savings are real, and on a large fixed prompt repeated millions of times, they are significant.

Here is the trade-off nobody puts in the headline. Compression adds a step to your pipeline, which means latency and a second model to run and pay for. It also adds a quality risk, because the compressor can drop a token that turns out to matter. The math only works when the prompt is large, repeated, and stable enough that the per-call compression cost is dwarfed by the per-call savings.

For most teams, the cheaper move is to delete the dead tokens by hand first, then cache the stable prefix. Prompt caching reads a cached prefix at 0.1x the input rate, a 90% discount on the cached portion, with no quality risk at all (Claude pricing). Reach for LLMLingua-style compression after you have trimmed and cached and the prompt is still genuinely large.

Manage context over a session: summarize, do not replay

The worst token leaks I have seen do not live in a single call. They live in the loop. A multi-turn chat or an autonomous agent accumulates context, and the lazy implementation replays the entire growing transcript on every step. By turn twenty the model is rereading thousands of tokens it has already seen, and you are paying for every one of them, every step.

The fix is to summarize instead of replay. After each step, compress the transcript into a compact running summary plus the last turn or two verbatim, and pass that forward instead of the raw history. The model keeps the thread; you stop paying input tax on ancient context. This is the core idea behind context management, and it is the difference between an agent loop that scales and one whose per-task cost climbs with every turn.

This is also why the per-call price lies to you on agentic workloads. A single autonomous task can fan out into thirty or fifty model calls, each carrying the conversation, so the per-call number looks trivial. The per-task number is what lands on the invoice, and it can be fifty times a single completion if the context is not managed. Summarizing between steps and capping each step's output is usually where the agent budget gets saved.

If most of your traffic is easy and only the tail is hard, the biggest-payoff move sits alongside this: send the easy majority to a smaller model so you are not paying frontier output rates to generate boilerplate. I make that case in the argument for shipping smaller models, and the routing mechanics live in the guide to LLM model routing. Token cutting and right-sizing the model are the same campaign from two angles. For the deeper theory of what fits in a window and what to evict, my book on context windows walks through the eviction strategies in detail.

A comparison you can paste into a planning doc

Here is the same set of techniques in one table: what each one saves, what it risks, and how much work it is to ship. Token-saving ranges are directional and depend on how bloated your current setup is.

Technique	Token saving	Quality risk	Effort
Cap output (max_tokens + length instruction)	High, on the expensive 5x-6x output side	Low if the cap fits the task; truncation if too tight	Low
Structured output (schema / JSON)	Medium; kills prose preambles	Low; small schema overhead	Low
Compact format (TSV / TOON over JSON)	Medium on tabular payloads	Very low; same information	Low
Trim system prompt and few-shot	Medium; recurs on every call	Medium if cut blind; low if cut against evals	Low
Tighten retrieval (lower top-k + rerank)	Medium to high on RAG	Low; often improves accuracy	Medium
Summarize context (don't replay history)	High in long sessions and agent loops	Medium; summary can drop a detail	Medium
Prompt compression (LLMLingua)	High on large fixed prompts (2x-20x)	Medium; compressor can drop a key token	High

Read the table top to bottom as a rough order of operations. The low-effort, low-risk rows at the top are where you start. Compression sits at the bottom not because it saves the least, but because it costs the most to do right and you should exhaust the free wins first.

Two mini-cases, with the numbers

The numbers below are illustrative, drawn from the shape of real situations rather than any specific live system, so treat them as the order of magnitude, not a quote.

The chatbot that narrated. A support assistant ran on a frontier model with no output cap, and the model habitually opened every answer with a paragraph of throat-clearing before the actual reply. Average output ran near 600 tokens where 150 would do. Adding max_tokens, a "be direct, no preamble" instruction, and a JSON envelope cut average output by roughly two-thirds, with the input untouched. Because output is the 5x token, the bill fell far more than the token count alone suggested.

The agent that replayed itself. An internal research agent averaged around thirty model calls per task, each replaying the full growing transcript as input. The per-call cost looked like rounding error; the per-task cost was the largest single line in the AI budget. Swapping verbatim replay for a running summary plus the last two turns, and capping each step's output, took a large bite out of per-task cost with no change in the final answer quality on the eval set. The fix was in the token budget, not the model.

Measure token cost per task, not per call

Every technique above optimizes tokens. The number you should actually be optimizing is cost per resolved task: total inference spend divided by the tasks the system fully handled without a human stepping in. Cost per token is what the model charges you. Cost per resolved task is what the model costs you, and only one of those shows up on the P&L.

This matters most for compression and aggressive trimming, because both can quietly trade quality for tokens. A cheaper call that fails more often is not cheaper once a human cleans up the mess; the token cost per call dropped while the cost per resolved task went up. The only way to catch that is to track quality and resolution alongside spend, which I cover in the guide to LLM evaluation.

Cost per token is what the model charges you. Cost per resolved task is what the model costs you. Optimize the second one.

Doing this in production, not in a one-off spreadsheet, is an AI observability and monitoring problem. You want token usage, output length, and resolution rate on the same dashboard, so a token cut that hurts quality shows up as a falling resolution rate before a customer finds it for you. The unit economics of all this, why a token saved is a margin point earned, is the subject of my book on pricing intelligence.

Frequently asked questions

What is LLM token optimization? It is the practice of reducing the number of tokens you send to and generate from a language model on each call, so your inference bill tracks the actual work instead of accumulated prompt bloat. The main levers are capping and structuring output, trimming dead input, compressing large fixed prompts, and summarizing context in long sessions instead of replaying it.

How do I reduce token usage the fastest? Cap output with max_tokens and a length instruction, then ask for structured output, because output tokens are priced 5x to 6x higher than input and that is where the payoff is. Next, prune the dead text out of your system prompt and lower your retrieval top-k. Both are low-effort, low-risk, and recur on every call.

Does prompt compression like LLMLingua actually help? Yes, but selectively. LLMLingua reaches high compression ratios with small quality loss on benchmarks, which pays off on large, stable prompts repeated at scale. For most teams the cheaper first move is to delete dead tokens by hand and cache the stable prefix, then reach for compression only if the prompt is still large.

Why are output tokens more expensive than input tokens? Generating output is sequential: the model produces one token at a time, each conditioned on all the ones before, so each output token needs its own full forward pass. Input can be processed in a single parallel batch. That is why vendors price output 5x to 6x higher, and why capping output cuts both cost and latency at once.

If you want the full picture of where inference spend goes and the order to pull every lever, start with the guide to reducing LLM inference cost. And if you would rather have a team trim the tokens, route the traffic, and instrument cost per resolved task in your stack from day one, that is exactly what the Devlyn team builds. Cut the token you never needed. Measure the ones that are left.

Hiring AI Engineers: The Definitive 2026 Guide

Alpesh Nakrani — Thu, 07 May 2026 18:30:00 GMT

AI engineers are the hardest role on the market to fill. Here is what good actually looks like, what it costs, and how the bad hires fail.

Hiring AI engineers comes down to one thing: hire for judgment, not for a list of frameworks on a resume. The engineer you want is the one who can look at a model output and tell you whether it is correct, why it is wrong when it is wrong, and what they would change before it ever touches a customer. Everything else, the model names, the vector database, the orchestration library, is learnable. Judgment under production pressure is the scarce thing, and it is what separates a real AI engineer from someone who has read the docs.

I have hired and deployed more than 80 senior AI engineers at Devlyn and shipped over 200 products on top of them. I sit in two seats at once: I read the traces and I read the P&L. That combination has taught me that most of what gets written about hiring AI engineers is wrong in the same direction. It treats the role as a senior software engineer who also knows machine learning, and it treats the hire as a sourcing problem. It is neither. The market is the tightest it has ever been, the failure modes are specific and expensive, and the difference between a good hire and a bad one does not show up until something is in production with real users.

This is the pillar guide for the whole topic. I will tell you what an AI engineer actually is, the roles you might actually need, the skills that matter versus the ones that just sound good, where to find people and how to vet them, what it costs in-house and outsourced, how these hires fail, and a checklist you can run. The deeper pieces, on skills, cost, interview questions, and the rest, branch off from here.

Hire for judgment, not throughput. The scarce skill is the ability to evaluate model output and own the outcome, not the ability to wire up an API.
The market is structurally short. AI is the single hardest skill to hire for globally, and demand outruns supply by roughly three to one. You are competing for a small pool.
"AI engineer" is not one role. It splits into application, LLM, retrieval, MLOps, agentic, and forward-deployed specialists. Hiring the wrong specialization wastes months.
Most AI projects fail in production, not in the demo. The bad hire ships an un-evaluated model that looks fine in a notebook and breaks in front of customers.
In-house and outsourced both work; the trap is hiring slowly for a role you cannot vet. If you cannot evaluate the candidate yourself, buy the judgment pre-vetted rather than gambling on a three-month search.

What an AI engineer actually is, and what they are not

An AI engineer builds production features on top of models: large language models, retrieval systems, vision and speech models, and the agents that chain them together. The defining word is production. A researcher trains and improves models. A data scientist finds signal in data and reports it. An AI engineer takes a model someone else built and turns it into something a customer can use without a human standing behind it apologizing.

That distinction matters because it changes what you are hiring for. A research background is nice but it does not predict whether someone can ship a streaming chat interface that handles a model timing out gracefully, enforces permission-aware data access, and degrades to a safe answer when retrieval comes back empty. Those are engineering problems with an AI surface, and they are where the actual work lives. I have written the longer version of this in what an AI engineer is, but the short form is: the job is building reliable software where one of the components is probabilistic.

The most common confusion is AI engineer versus machine learning engineer. The honest answer is that the titles overlap and companies use them loosely, but the center of gravity differs. An ML engineer leans toward training, fine-tuning, feature pipelines, and model lifecycle. An AI engineer leans toward composing existing models into a working product. I break this down fully in AI engineer vs ML engineer, because hiring the wrong one for your problem is one of the more expensive mistakes in this whole space.

The job is building reliable software where one of the components is probabilistic. That is harder than it sounds, and it is not what a research resume predicts.

The roles and specializations you might actually need

"AI engineer" is a category, not a job. When a founder tells me they need to hire an AI engineer, my first question is which problem they are solving, because the answer determines which specialist they actually need. Hiring a generalist for a problem that needs a retrieval specialist, or vice versa, costs you months you do not have.

Here is the map I use. Most companies need one or two of these, not all of them, and the right first hire is usually broader than founders expect.

Role	What they do	When you need them
AI application engineer	Builds the full feature: UI, API, model integration, retrieval, production controls	Your first AI hire, or any customer-facing feature
LLM engineer	Prompting, structured outputs, evaluation harnesses, safety controls around a model	The model behavior itself is the hard part
Retrieval engineer	RAG architecture, vector and hybrid search, chunking, retrieval evaluation	Answers must be grounded in your own data
MLOps engineer	CI/CD for models, serving, monitoring, drift detection, rollback, governance	You run real models at scale and need them to stay up
Agentic workflow engineer	Multi-step agents, tool permissions, approval gates, audit logging	An agent must safely take actions in connected systems
Forward-deployed engineer	Embeds with the customer, turns a pilot into a shipped deployment	You sell AI and the gap is integration, not capability

If you are not sure which one to start with, start with an application engineer. They cover the most surface area and will tell you, honestly, when the problem has narrowed enough to warrant a specialist. If you want senior, pre-vetted AI application engineers without the three-month search, that is exactly what we provide at Devlyn's AI application engineer practice: embedded senior engineers with a working proof point inside a week and a replacement guarantee if the fit is wrong.

The skills that actually matter (and the ones that just sound good)

If I could screen on one skill, it would be evaluation. The engineers who ship reliable AI are the ones who instinctively ask "how will we know this is good?" before they write the feature, not after it breaks. They build a held-out set of representative inputs, they categorize failures by type and severity, and they treat a passing eval as the gate for shipping. An engineer who cannot describe how they would evaluate a feature is an engineer who will ship you something that demos well and fails quietly.

After evaluation, the skills that matter are the unglamorous production ones. Retrieval architecture, because grounding answers in your data is the most common production pattern and the place most teams get it subtly wrong. Handling failure modes, timeouts, empty retrievals, malformed model output, because probabilistic systems fail in ways deterministic ones do not. And the judgment to choose between prompting, retrieval, and fine-tuning rather than reaching for the most complex option by reflex.

The skill that sounds good and matters less than people think is pure prompt engineering. Prompting is real and it is part of the job, but it is the entry-level slice of it. A candidate whose entire portfolio is clever prompts has not yet hit the problems that make AI engineering hard: what happens at scale, under cost pressure, when the model is wrong and a customer is watching. I go deeper on the full skill stack in the AI engineer skills guide, but the headline is that frameworks are learnable and judgment is not.

There is also a quieter market reality behind all of this. AI is now the single hardest skill category to hire for in the world. ManpowerGroup's 2026 Talent Shortage Survey found that 72% of employers across 41 countries struggle to fill roles, with AI skills topping the global list of hardest-to-find capabilities for the first time, ahead of traditional engineering. Independent supply-demand analyses put open AI roles at roughly three times the number of qualified candidates. You are not hiring from a deep bench. You are competing for a thin one.

Where to find AI engineers and how to vet them

Sourcing is the easy part and the part everyone over-indexes on. You can find candidates through your network, specialized communities, the usual platforms, contract-to-hire, or a vetted partner. None of those channels solves the actual problem, which is that you cannot tell a good AI engineer from a confident one with a clean resume until you watch them work on something real.

The vetting is where hires are won and lost. The interview that works is not a LeetCode loop and it is not a take-home that rewards polish. It is showing the candidate a piece of model output, ideally a plausible-looking wrong one, and asking what is wrong with it, what they would change, and what they would need to know before shipping it. People with judgment dig into the failure. People trained for throughput immediately pivot to how they would produce something better, which tells you they cannot yet see the gap between a plausible answer and a correct one.

The seniority question is real here. A senior AI engineer who can read model output and know immediately whether it is correct is worth several juniors whose output you have to check line by line, because in this domain the cost of a missed error is paid in production with customers watching. That is the case I make in senior vs junior AI engineers, and it is why my hiring posture is senior-first. For the specific questions that surface judgment in an interview, I have a full set in AI engineer interview questions.

If you cannot run that interview well yourself, that is not a small problem. It means you are about to spend three months and real money selecting on signals you cannot read. In that situation, buying the judgment pre-vetted is usually the better trade than gambling on your own ability to evaluate a skill you do not yet have in-house.

Cost and engagement models: in-house versus outsourced

Let me give you the numbers, because cost is where this decision usually gets made and where the most wishful thinking happens. In the United States, AI engineer base compensation at mainstream employers runs roughly $134K starting, $171K at the midpoint, and $193K at the high end, according to the Robert Half 2026 Salary Guide. Total compensation including equity runs higher: Levels.fyi data puts the median AI engineer around $151K and median machine learning engineer compensation well above $250K once stock and bonus are included. At frontier labs the numbers detach from reality entirely, with packages routinely clearing $800K, but that is a different market than the one most companies hire in.

Offshore changes the arithmetic substantially. Senior AI engineers run roughly $20–50 per hour in India and $35–70 per hour in Eastern Europe, with a 20–50% premium over baseline development rates for the AI specialization, per offshore-rate aggregators tracking 2026 pricing. The headline savings look like 40–70% versus a US hire, though after management overhead, time-zone friction, and rework, realistic net savings land closer to 30–60%. I break down the full math, including the costs nobody quotes you, in the AI engineer cost guide.

The engagement model matters as much as the rate. A full-time in-house hire makes sense when AI is core to your product and you need the knowledge to compound internally. An embedded senior engineer makes sense when you need senior judgment now without a year of recruiting. A fixed-scope sprint makes sense when you have one feature to ship and want a proof point before you commit. At Devlyn an embedded senior AI engineer runs around $5K per month, which against a US fully-loaded cost north of $200K per year is the comparison most founders are actually weighing. The longer in-house-versus-outsourced tradeoff, including when each genuinely wins, is in in-house vs outsourced AI.

The real cost trap is none of these line items. It is hiring slowly for a role you cannot vet and ending up with someone who looks the part and ships an un-evaluated model. That hire is more expensive than any rate card, and it is the failure mode I see most.

The most expensive AI hire is not the highest-paid one. It is the one you could not vet, who shipped something that demoed well and broke in production.

How AI hires fail, with the numbers behind it

The failure rates in this field are genuinely alarming and they are the backdrop to every hiring decision. MIT's Project NANDA reported in 2025 that 95% of organizations deploying generative AI saw zero measurable P&L impact. Gartner has projected that through 2026 organizations will abandon 60% of AI projects that are not supported by AI-ready data, and that the large majority of AI pilots never reach production at all. These are not edge cases. This is the base rate, and a bad hire pushes you straight into it.

The first failure mode is the un-evaluated model. An engineer builds a feature, it works in the notebook, they ship it. There is no held-out set, no failure taxonomy, no monitoring. It works for three weeks and then a customer hits an input the engineer never imagined and the model confidently returns something wrong, in front of a real person, with no safety net. Picture a support-deflection bot that hit 94% accuracy in testing and then, the week after launch, started confidently inventing a refund policy that did not exist. The model was never the problem. The absence of evaluation was. This is illustrative, not a specific account, but I have watched versions of it more than once.

The second failure mode is the demo that was never a product. The engineer is brilliant at the impressive first version and has no instinct for the unglamorous 90% that makes something shippable: the rate limits, the timeouts, the permission checks, the graceful degradation. Imagine a team that built a slick agent demo in two weeks, showed it to the board, and then spent four months discovering it could not safely touch a production database. The demo was real. The product was four months away and nobody had priced that in.

The third failure mode is the resume-keyword hire. Someone lists every framework, interviews well on vocabulary, and turns out to have wired up tutorials without ever owning a system in production. Contrast that with the hire who gets it right: a team I think of brought in one senior engineer specifically for judgment, not framework breadth, and that engineer's first move was to build an eval set before writing a feature. The product shipped, and it held up, because someone in the room could tell good output from plausible output. That is the whole game.

A hiring checklist you can actually run

Here is the checklist I would hand a founder making their first AI hire, distilled from the failures above. Run it in order.

Define the problem before the role. Name the specific feature or capability you need shipped, then pick the specialization from the roles table. Do not hire "an AI engineer" in the abstract.
Screen on evaluation first. Ask how they would know the feature is good. If they cannot answer concretely, stop there.
Interview for judgment, not production speed. Show them flawed model output and ask what is wrong with it. Watch whether they analyze or pivot to producing.
Check for a production scar. Ask about a time their AI system failed with real users and what they changed. People who have shipped have these stories. People who have not, do not.
Bias senior for the first hire. The first AI hire sets the floor for everything after it. Pay for judgment you can trust unsupervised.
Pressure-test the cost honestly. Compare fully-loaded in-house cost against an embedded or vetted option, including the cost of a three-month search and a possible bad hire.
Buy vetting you do not have. If you cannot evaluate the skill yourself, do not gamble on a cold hire. Use a partner who can.

For the question of timing, when in a company's life the first AI hire actually pays off, I have a dedicated piece on when to hire an AI engineer. And once you are past the first hire and building out a function, building an AI team covers the structure and ratios that hold up as you scale.

The deeper framework underneath all of this, why judgment became the scarce input and what that does to how you staff, is the thing I have spent the most time on. I wrote it up at length in what a team is for after the machine does the work, and the full playbook is in the book Building an AI-Native Team: Hiring for judgment, not throughput. If you read one thing after this guide, read that, because hiring AI engineers well is downstream of getting the hiring philosophy right.

Frequently asked questions

How do you hire an AI engineer?

Define the specific problem you need shipped, choose the specialization that fits it, then vet for judgment rather than framework breadth. The interview that works shows the candidate flawed model output and asks what is wrong with it and what they would change before shipping. Screen on evaluation first: an engineer who cannot describe how they would measure whether a feature is good will ship you something that demos well and fails quietly in production.

What does it cost to hire an AI engineer in 2026?

In the United States, base compensation runs roughly $134K to $193K at mainstream employers, with total compensation including equity often above $200K, and frontier-lab packages far higher. Offshore senior AI engineers run roughly $20–70 per hour depending on region, and embedded engagement models land around $5K per month. The cost that actually matters, though, is the cost of a bad hire who ships an un-evaluated model, which dwarfs any rate-card difference.

Why do so many AI hires and projects fail?

The base rate is brutal: MIT reported 95% of generative AI deployments showed no measurable P&L impact, and Gartner projects most pilots never reach production. The common thread is not model quality. It is the absence of evaluation and production discipline, the un-evaluated model shipped from a notebook, the demo mistaken for a product, the resume-keyword hire who never owned a live system.

Should I hire an AI engineer in-house or outsource?

Hire in-house when AI is core to your product and the knowledge needs to compound internally, and you can actually vet the candidate. Outsource or embed when you need senior judgment now, cannot afford a three-month search, or cannot evaluate the skill yourself yet. If you want senior, pre-vetted AI engineers without the search, that is the work we do at Devlyn, with a working proof point inside a week and a replacement guarantee if the fit is wrong.

AI Engineer Skills: What Actually Separates the Good Ones

Alpesh Nakrani — Wed, 06 May 2026 18:30:00 GMT

The AI engineer skills that matter in 2026 are LLM and RAG work, eval design, prompt and context engineering, and solid software fundamentals. The one that separates the good hires is judgment.

The core AI engineer skills in 2026 are these: working fluency with LLM APIs and retrieval-augmented generation (RAG), the ability to design evals that predict production behavior, prompt and context engineering, real software engineering and Python, and enough MLOps and LLMOps to ship and operate a system that does not fall over at scale. That is the technical floor. Every competent candidate has some version of it. The skill that actually separates a good AI engineer from an expensive one is judgment: knowing when the model is wrong, when "good enough" is actually good enough, and which 5% of cases will hurt you in production.

I am an engineer who moved into a CRO seat at Devlyn, where part of my job is hiring and deploying AI engineers into real products that touch real customers. I have read a lot of resumes that list every keyword in this article and interviewed the people behind them. The resume tells you what someone has been near. It does not tell you what they can do. This piece is about the difference, written from the side of the table that pays for the mistakes.

If you are sizing up the whole hiring process and not just the skills, this article sits under my guide to hiring AI engineers. Read that for the full picture. Read this for what to actually look for.

Key takeaways

The technical floor is real but commoditized. LLM and RAG fluency, eval design, prompt and context engineering, Python, and LLMOps are table stakes. Most candidates clear the floor on paper.
Judgment is the separator. The engineer who knows when the model is wrong, and which failures actually cost you money, is worth far more than the one who ships fastest.
Eval skill is the most underrated AI engineer skill. An engineer who can build an honest eval suite is telling you they know how their system fails. That is rarer than RAG knowledge and far more valuable.
Resume keywords are not skills. "LangChain, RAG, vector DBs" on a resume signals exposure, not competence. The test is whether they can explain a trade-off they made and what it cost.
Soft skills are not soft. Ownership, system-level thinking, and the discipline to say "I do not trust this output yet" are what keep an AI feature from embarrassing you in front of a customer.

The core technical skills an AI engineer needs in 2026

Let me lay out the technical floor honestly, because you cannot evaluate judgment in someone who lacks the fundamentals. These are the AI engineer skills that should appear in some form in any serious candidate, and most of them now show up on every resume in the pile.

LLM and RAG fluency. The dominant production pattern is still retrieval-augmented generation: connect a model to an external knowledge base at inference time so answers stay grounded in real content. The easy part is wiring it up. The real skill is building a RAG system that survives messy data, ambiguous queries, and corpus drift over eighteen months. I have written about what that looks like once it hits reality in how agentic workflows behave in production.

Prompt and context engineering. Prompt engineering is how you talk to the model. Context engineering is what the model can see when it answers: memory, retrieved documents, tool definitions, conversation history, the whole information environment. In a 2026 industry survey, 82% of IT and data leaders said prompt engineering alone is no longer enough to run AI at scale (State of Context Management Report, 2026). The Model Context Protocol (MCP), now stewarded under the Linux Foundation, has become the common standard for how agents discover and call tools (Anthropic's guidance on context engineering covers the shift well). A strong engineer designs the context pipeline, not just the prompt string.

Software engineering and Python. Python and SQL remain the working languages, but the skill that gets undervalued is plain software engineering: writing APIs that hold up, handling errors and timeouts, designing for the system that gets ten thousand concurrent users rather than the demo that gets one. An AI feature is still software. Most AI features that fail in production fail for boring software reasons, not exotic model reasons.

MLOps, LLMOps, and observability. Shipping is half the job. The other half is operating: Docker and a cloud platform, deployment pipelines, cost controls, and the observability to know what your system is doing once it is live. LLMOps is MLOps with new failure modes, where a regression can be a model that got more confident and less correct rather than a service that went down.

Eval design. I am listing this last in the technical floor and giving it its own section next, because it is the one skill that crosses from "technical competence" into "judgment." An engineer who can design an eval is an engineer who has thought hard about how their system fails. That is the rarest and most valuable thing on this list.

If you want help hiring against this exact floor rather than guessing at it, Devlyn places senior AI application engineers who have already been vetted against it. That is the work we do every week.

The skill that actually separates good AI engineers: evaluation judgment

Here is my thesis, and it is the one I will defend hardest. The technical floor is necessary and roughly commoditized. The thing that separates a good AI engineer from a dangerous one is the ability to tell when the model is wrong and how much that wrongness costs. I call this evaluation judgment, and it is the skill the resume keywords cannot show you.

An LLM is fluent by default. It produces confident, well-formed output whether it is right or wrong. A junior engineer reads that fluency as correctness and ships it. A senior engineer reads it as a claim that needs checking, builds the machinery to check it, and knows which failures are tolerable and which will end up in a customer complaint. The gap between those two people is not measured in years. It is measured in whether they have been burned enough to stop trusting confident output.

This is why eval skill is the single best proxy for the broader judgment you are hiring for. An engineer who can build an honest eval suite has already done the hard thinking: what does "correct" mean for this task, which failure modes matter most, what is the cost of being confidently wrong. I have made the full case for measuring this in my complete guide to LLM evaluation, and the metric I care about most is whether the eval set is frozen and sampled from real traffic rather than a number someone can edit upward when it looks bad. A candidate who can explain a grounded metric like faithfulness, the share of answer claims actually supported by the retrieved context (RAGAS defines it precisely), is showing you they think in failure modes rather than vibes.

The technical floor is commoditized. The thing that separates a good AI engineer from a dangerous one is knowing when the model is wrong and how much that wrongness costs.

I once watched two engineers tackle the same task: classify inbound customer questions for an AI support flow. The first reported 94% accuracy after an afternoon and wanted to ship. The second spent three days building a frozen eval set from real tickets, came back with 88%, and then showed me that the 12% of misses clustered in the highest-value billing questions where a wrong answer meant a refund and an angry call. We shipped the second engineer's version. The lower number was the more honest one, and the more profitable one. (Numbers illustrative, the lesson is not.)

This is the same instinct I argue for in the judgment economy: when the machine writes the code and drafts the answer, the scarce human skill is evaluating the work, not producing it. An AI engineer who has internalized that is worth a multiple of one who has not.

Soft skills are not soft: ownership, communication, system-level thinking

The phrase "soft skills" makes hiring managers roll their eyes, so let me reframe it. In AI engineering these are not soft. They are the difference between a feature that works and a feature that quietly corrupts data for a month before anyone notices.

Ownership. The most dangerous pattern in 2026 is the engineer who generates output they do not fully understand and ships it anyway. One study found that 85% of junior developers felt AI tools improved their understanding, while only 16% of seniors believed juniors actually understood the AI-generated code they were submitting. That gap, between producing and owning, is exactly the gap you are hiring against. You want the person who treats the system's output as theirs to defend.

System-level thinking. A model can write any single component. What it cannot do is understand the consequences of that component three layers down: that a small prompt change will fracture a shared pattern, or that a new retrieval source will quietly shift the cost curve. That kind of taste only comes from working through enough real systems to feel where they drift. I dig into this in the case for shipping smaller models, which is really a piece about engineering judgment over benchmark chasing.

Communication and honesty about uncertainty. The best AI engineer I have hired was not the strongest coder in the room. He was the one who said "I do not trust this output yet, here is how I am going to find out if it is safe." An engineer who can tell a non-technical stakeholder why the model failed, in plain language, is worth more in a customer-facing product than one who cannot, regardless of raw skill.

AI engineer skills by seniority: junior, mid, senior

The same skill list reads completely differently depending on the level you are hiring for. Here is how the expectations shift, and what I actually probe for at each stage.

Junior. A junior AI engineer should clear the technical floor: they can build a RAG pipeline, call an LLM API, write clean Python, and use an orchestration framework. The thing I look for is not output speed. It is the failure response. When the model gives a wrong answer, does the junior notice something is off and dig, or do they accept it and move on? A strong junior catches the smell of a wrong answer. A weak one ships it. That instinct, more than any framework, predicts who becomes good.

Mid-level. A mid-level engineer owns a feature end to end: design, evals, deployment, and the on-call when it breaks. They should be able to make and defend a real trade-off, such as choosing a smaller model and routing the hard cases to a larger one, the pattern I describe in model routing. At this level I want to hear about a decision they got wrong and what it taught them.

Senior. A senior AI engineer is hired for judgment, not throughput, because the throughput is increasingly the machine's job. They set the eval bar, they decide what "safe to ship" means, they know when to keep a human in the loop and when that is just a way to avoid building a real system. I cover that specific failure mode in why "a human reviews it" is not a plan. A senior who cannot articulate their eval philosophy is not actually senior, no matter the title.

The AI engineer skills table: what to test and how

Here is the same skill set as a hiring tool. For each skill, what it is, why it matters, and a concrete way to test for it that resume keywords cannot fake.

Skill	Why it matters	How to test for it
RAG and retrieval	Most production LLM features are RAG; the hard part is surviving messy data and drift	Ask how they would handle a corpus that changes weekly and queries the docs never anticipated
Eval design	Best proxy for judgment; shows they understand how their system fails	Ask them to design an eval for a task on the spot; listen for "frozen set" and "failure modes by cost"
Prompt and context engineering	Quality and cost live in the context pipeline, not the prompt string	Ask what goes into the context window for an agent and why each piece earns its tokens
Software engineering	Most AI features fail for boring software reasons, not model reasons	Have them debug a flaky API call with timeouts and partial failures, no model involved
LLMOps and observability	Operating a live system is half the job; silent regressions are the real risk	Ask how they would know, in production, that the model got quietly worse last Tuesday
Judgment and ownership	The actual separator; knowing when output is wrong and what it costs	Ask about a time the model was confidently wrong and what they did; vague answers are a red flag

How to spot real skill versus resume keywords

The resume-optimization industry has trained candidates to stuff every keyword in this article into a skills section: LLMs, RAG, vector stores, orchestration, function calling, structured output, model routing, token optimization. An applicant-tracking system rewards that, so the keywords tell you almost nothing about whether the person can do the work. You have to test past them.

The single best test I know is also the simplest: ask the candidate to walk you through a system they built, the decisions they made, the trade-offs they weighed, and what they would do differently. A person who actually built it will get specific fast. They will tell you why they chose pgvector over a hosted option, what broke at scale, what the eval looked like, what the wrong answer cost. A person who pattern-matched their way through tutorials will stay generic, because generic is all they have.

Watch for three tells. First, can they name a trade-off and its cost, not just a technology? "We used RAG" is a keyword. "We used RAG and ate higher latency to keep answers auditable for compliance" is a skill. Second, how do they talk about failure? Real engineers talk about failure constantly because they live in it. Third, when you push on a decision, do they defend it with reasoning or retreat to "that is just best practice"? Best practice is what people cite when they do not understand the trade-off.

"We used RAG" is a keyword. "We used RAG and ate higher latency to keep answers auditable for compliance" is a skill. The cost of the trade-off is where the skill lives.

The market makes this harder, not easier. AI job postings sat roughly 134% above their 2020 baseline while overall postings grew about 6% (Indeed Hiring Lab, 2026), and one workforce report named AI the hardest skill in the world to hire for (ManpowerGroup, 2026). Demand that intense produces a lot of resumes that look identical and a wide gap in what is behind them. Median US AI engineer pay sits around $173,000 with senior offers well past that (Glassdoor, 2026), which means a bad hire is expensive twice: once in salary and once in the production mess they leave. Testing for real skill is not optional at those numbers.

If you would rather not run that gauntlet yourself, this is precisely the filtering we do at Devlyn before anyone reaches your team. We put a senior AI application engineer in front of you with a week-one proof of concept, so you evaluate the work and not the resume. You can see how that engagement works here.

Frequently asked questions

What skills does an AI engineer need in 2026?

The technical floor is LLM and RAG fluency, eval design, prompt and context engineering, real software engineering and Python, and enough MLOps and LLMOps to deploy and operate a system in production. Above that floor, the skill that actually separates strong engineers is judgment: knowing when model output is wrong and which failures carry real cost. The keywords are the entry ticket, not the qualification.

Is prompt engineering still a valuable AI engineer skill?

Yes, but it has been absorbed into the larger skill of context engineering, which is about designing the entire information environment the model sees, not just the prompt wording. In a 2026 survey, 82% of IT and data leaders said prompt engineering alone was no longer sufficient to run AI at scale. Treat prompt craft as one part of building a context pipeline, including memory, retrieval, and tool definitions.

What separates a senior AI engineer from a junior one?

Both can clear the technical floor; the difference is judgment under uncertainty. A junior often treats confident model output as correct and ships it, while a senior treats it as a claim to verify, builds evals to check it, and knows which failures are tolerable. The cleanest tell in an interview is what they do when the model is wrong: a strong engineer notices and digs, a weak one moves on.

How do you test for real AI engineering skill instead of resume keywords?

Ask the candidate to walk through a system they built, including the trade-offs they made and what each one cost. Real builders get specific about failures and the reasoning behind their decisions, while people who pattern-matched through tutorials stay generic. Probe a decision and see whether they defend it with reasoning or retreat to "best practice."

The skills that made a great engineer in 2018 are not the ones that matter when the machine writes the code. I wrote a short book on how to hire for the new ones, Building an AI-Native Team, which goes deeper on hiring for judgment over throughput. And if you want a vetted senior AI application engineer who already clears this bar, proving it with real work in week one, that is exactly what Devlyn's hiring engagement is built to deliver. Hire the judgment. The keywords are easy to find.

AI Engineer Interview Questions That Reveal the Real Ones

Alpesh Nakrani — Tue, 05 May 2026 18:30:00 GMT

The AI engineer interview questions that work test judgment, not trivia: RAG failure modes, eval design, and how a candidate handles being wrong.

The AI engineer interview questions that actually reveal a real engineer are not the ones you can look up. They are the ones that force a candidate to show judgment: how did you measure that it worked, what broke in production, why did you build it that way instead of the obvious way, and what did you do when the model was confidently wrong in front of a customer. I have sat on both sides of this table, the seat where I am being grilled and the seat where I am doing the grilling, and the gap between a candidate who can recite the transformer architecture and one who can ship a margin-positive AI feature is almost never visible in the trivia round.

I run hiring loops for AI engineers, vet them for clients, and then deploy them on real revenue-bearing work, so I get fast feedback on whether my interview was actually predictive. It usually was not, the first few times. I was asking the questions everyone asks, the ones in the top ten search results, and I kept hiring people who interviewed beautifully and then froze the first time an eval set disagreed with their intuition. This piece is the loop I run now, written from the hiring seat, but if you are the one being interviewed, read it as a map of what a good interviewer is actually trying to learn.

Knowledge is no longer the signal. Anyone can recite RAG architecture in 2026. You are buying judgment: how a candidate frames a problem, measures success, and handles being wrong.
The eval question is the fastest filter. "How did you measure that it worked?" separates engineers who shipped from engineers who demoed. No test set, no metric, no failure-mode discussion is a red flag at mid-level and above.
Make them debug, not define. "Your RAG support bot returns confident wrong answers, walk me through finding the cause" reveals more than any definition ever will.
Take-home as a pure signal is dying. 71% of engineering leaders say AI makes technical skills harder to assess, so the live round where you watch them work is now where the real signal lives.
Red-flag answers are louder than green ones. Notebook-only experience, fabricated model names, and "100% accuracy" tell you more in ten seconds than a polished resume tells you in an hour.

Why most AI engineer interview questions miss

The default AI engineer interview is a trivia quiz: explain attention, the difference between fine-tuning and RAG, what temperature does. These questions felt discriminating in 2022 because the knowledge was rare. In 2026 the knowledge is free, a candidate can absorb a clean explanation of any of it in an afternoon, and a meaningful fraction will quietly have a model feeding them answers during the call anyway.

So the trivia round mostly measures preparation, not capability. I have hired people who aced it and could not reason about why their retrieval was returning garbage, and passed on people who fumbled a definition and would have been my best hire of the year. As one hiring lead put it bluntly this year, knowledge is free in the age of ChatGPT, and what you are actually testing for is judgment.

That reframing changes every question you ask. You are not trying to confirm the candidate knows what a vector database is; you are trying to find out whether, handed a vague business problem and a budget, they will build the right thing, know whether it works, and tell you the truth when it does not. If you only take one idea from this piece into your next loop, take that one. For the wider hiring picture this sits inside, see my definitive 2026 guide to hiring AI engineers, which frames where these interview questions fit in the whole loop.

The technical questions that actually reveal an AI engineer

I still ask technical questions. I just ask them in a form that punishes recitation and rewards reasoning. The trick is to give the candidate a broken system or a real constraint, not a definition prompt, and then watch how they move.

"Your RAG support bot is returning confident, wrong answers. Walk me through how you find the cause." A weak answer jumps straight to "increase the context window" or "use a better model." A strong answer separates the failure modes first: is retrieval pulling the wrong chunks, or is retrieval fine and generation is ignoring the context. They will want to look at retrieved documents before touching the prompt, because they have lived this and know retrieval is the usual culprit. Designing a RAG system end-to-end is the most common opening question across companies in 2026, and the debugging variant is where the depth shows.

"How would you cut the inference cost of this feature in half without users noticing?" Weak answers say "use a cheaper model" and stop. Strong answers talk about routing cheap requests to a small model and escalating only the hard tail, about caching, about whether half the calls even need a model. They treat cost as a design dimension, not an afterthought, because in production it is the difference between a feature that ships and one that gets killed in a margin review.

"This prompt works in your testing and fails for 8% of real users. What do you do first?" The weak instinct is to tweak the prompt until the eight percent shrinks. The strong instinct is to go look at the eight percent, find the pattern, and decide whether it is a prompt problem, a retrieval problem, or a problem the model genuinely cannot solve and should hand to a human. That distinction is the whole job.

You are not trying to confirm the candidate knows what a vector database is. You are trying to find out whether, handed a vague problem and a budget, they will build the right thing and know whether it works.

The eval and judgment questions that separate seniors

If I could keep only one question, it would be this: "How did you measure that it worked?" It is disarmingly simple and it splits the room. Any answer that does not include a test set, a metric definition, and a discussion of failure modes is a red flag at mid-level and above. I have watched senior-titled candidates describe a year of work on a production AI system and have no answer to this beyond "users seemed happy." That is not a measurement, it is a hope.

The follow-up sharpens it: "How do you know it works when there is no single right answer?" Evaluation thinking, the ability to measure quality on open-ended output, is now treated as one of the core skills an AI engineer interview should test. A strong candidate reaches for a frozen, production-sampled eval set, a human-disagreement rate, faithfulness against the source, and the honesty that aggregate accuracy on a set you keep editing is a lie you tell yourself. If you want the full version of that conversation, my complete guide to LLM evaluation is the reference I send candidates and clients alike.

Then the question that finds the operators: "Tell me about something you decided not to build." Before strong candidates answer how to build a thing, they ask why it needs to be built, and problem framing is what separates an engineer from a code-typist. The senior engineer has a story about pushing back on a feature, scoping a problem down, or shipping a boring rules-based solution because the model was overkill. The junior, however bright, almost never does, because they have not yet been responsible for the cost of being wrong.

One client loop made this vivid. We had two finalists for a single role, both strong on paper, for a customer-facing recommendation flow where a wrong answer costs a sale. The first answered the measurement question with a crisp account of a frozen eval set and a 6.8% human-disagreement threshold she had defended to leadership. The second talked for three minutes about model architectures and never once mentioned how he knew any of it worked.

We hired the first. Six months in, her flow was holding its error budget and the team trusted its numbers. That is the difference the eval question predicts, and it is invisible in any round that only tests knowledge.

System design: design a RAG support bot

The system-design round for an AI engineer is not the same as the classic distributed-systems round, though it borrows from it. I give a deliberately underspecified prompt, "design a RAG-based customer support assistant for our product," and the first thing I am grading is whether they ask questions before they draw boxes.

What is the volume. What is the latency budget. What happens when the bot does not know, does it guess or escalate. How do we measure whether it is helping or quietly making support worse. A candidate who starts sketching an architecture diagram without asking any of that is showing me how they will behave on the job, which is to build before they understand the problem.

The strong design names its failure modes out loud: retrieval misses, stale documents, the bot answering confidently outside its knowledge. It includes an escalation path to a human and an observability layer so you can see cost and quality per resolved ticket, not just per call. It treats the small-model-first, escalate-rarely pattern as the default rather than reaching for the frontier model on every request. These are the same instincts I look for in the skills that actually separate good AI engineers, and they show up most clearly under the pressure of an open design prompt.

Behavioral and ownership questions

The behavioral round for AI engineers has one job: find out whether accountability is internalized or performed. AI work fails in public and fails often, a model says something wrong to a real customer, and the engineer who owns that moment is worth more than the one who is brilliant in a notebook.

"Tell me about a time your model was wrong in production. What happened and what did you do?" I am listening for whether deployment was the finish line or the starting line. Candidates who see shipping as the end have usually not shipped anything real, because everyone who has knows the model misbehaving in week three is the actual job. A strong answer has a monitoring story, a rollback story, and a what-I-changed-so-it-would-not-recur story.

"Walk me through a decision you got wrong." The point is not the mistake, it is whether they can name it without flinching and tell me what it cost. The performance of accountability sounds like "my biggest weakness is caring too much." Real accountability has a number attached, a week lost, a customer escalation, a cost overrun, and a clear account of the lesson. The honesty in that answer correlates almost perfectly with how someone behaves when the inevitable production incident lands on a Friday afternoon.

The red-flag answers

Some answers should end the conversation, or at least demote the candidate a level. These are the ones I have learned to listen for, and most of them surface in the first fifteen minutes.

Notebook-only. Comfortable in Jupyter, never shipped to production. They treat deployment as someone else's problem, which means they have never owned the part of the job that is actually hard.
No failure-mode discussion. They describe building an agent or a RAG system with zero account of how it broke. Either they did not look, or they did not ship it long enough to find out.
Fabricated specifics. Ask a candidate to name the current Claude generation and its pricing. A confident answer naming a model that does not exist is a tell that their information sources are stale and their confidence outruns their knowledge, which is exactly the failure mode you cannot afford in someone who decides what to ship.
"100% accuracy." Any claim of perfect accuracy, or "implemented RAG" with no recall metric, signals someone who does not measure. The number is impossible, and offering it confidently means they either do not know that or hope you do not.

None of these is automatically disqualifying on its own, people get nervous and misspeak, but two of them in one interview is a pattern. The strongest signal is the inverse: a candidate who volunteers what they got wrong, what they could not measure, and where the model still fails. Honesty about limits is the rarest and most valuable thing in this market.

The strongest signal is the inverse of a red flag: a candidate who volunteers what they got wrong, what they could not measure, and where the model still fails.

Take-home versus live, and the AI-assisted round

The honest take in 2026 is that the take-home assignment as a pure signal is decaying fast. 71% of engineering leaders now say AI is making technical skills harder to assess, per Karat's survey of 400 leaders across the US, India, and China. A take-home built to test coding-without-AI measures a skill the candidate will never use again, because the baseline expectation at every serious org is that engineers use AI tools daily.

That same survey shows the geography of the shift: 45% of US orgs still lean on take-home projects versus 20% in China, and Chinese companies are nearly twice as likely to allow AI use in live interviews. The direction of travel is clear. The signal is moving from "what did you produce alone overnight" to "show me how you work, with the tools you actually use, while I watch."

Field research into AI-engineering hiring through late 2025 and early 2026 found the typical process runs three to six rounds over two to six weeks, with take-homes spanning anywhere from a couple of hours to three days, usually building a small RAG or agent application. I still use a short take-home, but I treat it as a conversation starter, not a verdict. The next round is the candidate walking me through their own code, and I am grading the walkthrough, not the artifact.

The format that tells me the most is the AI-assisted round: hand the candidate an AI coding tool and watch what they prompt, what they accept, and what they catch. Strong candidates read AI-generated code and almost always test it before trusting it, while weak candidates accept the first plausible output and move on. That single behavior, do they verify or do they believe, is the most predictive thing I observe in an entire loop, because it is the exact judgment they will exercise a thousand times on the job. It is also the single hardest thing to screen for at volume, which is why I weight the live round so heavily.

AI engineer interview questions: a table you can run your loop from

Here is the loop in one place: the question, what it is actually testing, and how to tell a weak answer from a strong one. Run your interview from this and you will learn more in an hour than the trivia quiz teaches you in three.

Question	What it reveals	Weak answer	Strong answer
How did you measure that it worked?	Whether they shipped or demoed	"Users seemed happy"	Frozen test set, metric definition, failure modes by severity
Your RAG bot returns confident wrong answers. Find the cause.	Debugging instinct under ambiguity	"Use a bigger model"	Separates retrieval failure from generation failure; inspects chunks first
Cut this feature's inference cost in half.	Whether cost is a design dimension	"Switch to a cheaper model"	Routing, caching, asking which calls need a model at all
How do you measure quality with no single right answer?	Eval maturity	Vague "human review"	Human-disagreement rate, faithfulness, frozen production-sampled set
Tell me about a model that was wrong in production.	Ownership past the ship date	"We tested it well, so it didn't happen"	Monitoring, rollback, root cause, the fix that stopped recurrence
Name the current frontier model and its pricing.	Whether their sources are current	Confidently names a model that does not exist	Names a real current model and rough pricing, or admits uncertainty

The point of the table is not to read it out like a script. It is to keep you anchored on the column that matters: not what the candidate knows, but what their answer reveals about how they will behave when the work gets hard.

Frequently asked questions

What are the best AI engineer interview questions to ask? The ones that test judgment over recall: "how did you measure that it worked," "walk me through debugging a RAG bot that returns confident wrong answers," "cut this feature's cost in half," and "tell me about a model that was wrong in production." Each forces the candidate to reason from real experience instead of reciting a definition, which is the only thing that predicts on-the-job performance now that knowledge is freely available.

What is the biggest red flag in an AI engineer interview? No answer to "how did you measure accuracy." Any response that lacks a test set, a metric definition, and a discussion of failure modes is a red flag at mid-level and above. Close behind it are notebook-only experience with no production exposure, claims of "100% accuracy," and confidently naming a model or spec version that does not exist.

Should I use a take-home assignment or live coding? Lean live. 71% of engineering leaders say AI is making technical skills harder to assess, so a take-home built to test unaided coding measures a skill the candidate will never use on the job. Use a short take-home as a conversation starter if you like, then spend the real signal in a live round where you watch how they work with AI tools, what they prompt, what they accept, and what they catch.

How do I interview a senior AI engineer differently from a junior? Push harder on judgment and ownership. A senior should have stories about something they decided not to build, a model that failed in production and what they changed, and how they framed a vague problem before writing code. Juniors can have strong fundamentals but rarely have lived the cost of being wrong, which is exactly what senior questions are designed to surface.

If running this loop well sounds like more interviewing infrastructure than you want to build, that is the gap I built Devlyn's pre-vetted AI engineers to close, skip the gauntlet and get engineers already screened against these signals. For the full hiring system this loop fits inside, start with the definitive guide to hiring AI engineers, and for the deeper playbook on building the team around them, my book Building the AI-Native Team walks through it end to end.

AI Engineer Cost: What It Really Takes to Hire One

Alpesh Nakrani — Mon, 04 May 2026 18:30:00 GMT

AI engineer cost is far more than salary. Here are the real 2026 ranges, the loaded number nobody quotes you, and how to choose between in-house, staff aug, and an agency.

The honest answer to "what does an AI engineer cost" is a range, and you need to hear the whole range before you commit a budget. In the US in 2026, a competent AI engineer runs roughly $150,000 to $200,000 in base salary at mainstream employers, climbing past $300,000 base for senior specialists and into total-comp territory of $500,000 and up at frontier labs. Offshore and nearshore the same capacity costs $3,500 to $13,000 a month all-in. And a senior AI engineer through a transparent-rate agency lands somewhere in between, with no hiring risk attached.

But every one of those numbers is the salary line, and the salary line is the smallest part of the true cost. I have hired and deployed more than 80 senior AI engineers at Devlyn and I sit in two seats at once: I read the model traces and I read the P&L. From that seat, the salary is the part of the cost I worry about least.

The loaded overhead, the months of ramp before the person ships anything that holds up in production, and the catastrophic cost of getting the hire wrong are where the real money is. This piece is the cost deep-dive that branches off my pillar guide to hiring AI engineers, and I am going to give you the whole picture, not just the rate card.

Salary is the smallest line item. A US base of $170K becomes a fully-loaded cost of $215K to $240K once you add benefits, taxes, equipment, and software, before the person has shipped anything.
Ramp is a hidden cost center. A new senior AI engineer runs at 40-60% capacity for their first six months, and a senior teammate loses 20-40% of their time mentoring them through it.
Offshore is cheaper on the rate card, not always on the outcome. A $4,000/month engineer who needs heavy review can cost more in senior attention than a $5,500/month one who does not.
A bad AI hire is the dominant risk. It costs 30-50% of annual salary at minimum, and $150K to $300K for a senior when you price in the wasted ramp and the production damage.
Optimize for cost per shipped feature, not cost per hour. The cheapest hour and the cheapest outcome are rarely the same person.

AI engineer cost in the US, by seniority

Let me start with the salary line, because it is the number everyone anchors on, and then I will spend the rest of the piece explaining why it is the wrong number to anchor on. The figures here are real and sourced; where I give a range it is because the real data is a range, not because I am hedging.

The neutral government anchor is the Bureau of Labor Statistics, which puts the median wage for computer and information research scientists, the category that covers most AI and ML roles, at $140,910 as of May 2024. That is the floor of the conversation, not the market for a strong production AI engineer. For that, the Robert Half 2026 Salary Guide is more useful: it puts the AI/ML engineer midpoint at $170,750, up 4.4% year over year, the steepest jump of any tech specialty it tracks.

Break that out by seniority and the spread is wide. Entry-level AI engineers cluster around $100,000 to $145,000 base. Mid-level engineers doing genuine production work run $155,000 to $200,000. Senior specialists command $200,000 to $310,000 in base alone, per 2026 offer data from Levels.fyi and Robert Half.

Total compensation tells a bigger story than base. Once you add equity and bonus, a senior package regularly clears $300,000 at mainstream tech companies, and equity can make up a large share of it.

Then there is the ceiling, which I want to label clearly as an outlier so you do not budget against it. At frontier labs like OpenAI and Anthropic, Levels.fyi data from May 2026 shows median total compensation for engineers in the $600,000 to $795,000 range, with equity making up the majority of the package above mid-level. That is a market for a few thousand people on earth. It is not the market you are hiring in unless you are competing for the same handful of researchers, and if you are, you already know it.

The total cost of ownership is not the salary

Here is where most cost conversations go wrong. You negotiate a $170,000 base, you write it into the budget, and you think that is what the engineer costs. It is not close. The salary is the input to the cost, not the cost.

Start with the loaded number. A US employer pays roughly 25% to 40% on top of base for payroll taxes, health benefits, retirement contributions, equipment, software licenses, and the slice of facilities and overhead that the role consumes. That turns a $170,000 base into a fully-loaded cost of roughly $215,000 to $240,000 a year, before the engineer has shipped a single thing that survives contact with production.

Then add ramp, which is the cost nobody puts in the budget. A senior engineer joining a non-trivial codebase reaches genuine time-to-productivity somewhere between four and nine months in, and operates at 40% to 60% of full capacity for the first six. During that window the cost is doubled: you are paying the new hire's loaded salary while getting partial output, and a senior teammate is spending 20% to 40% of their time mentoring, reviewing, and unblocking, which is real output you are no longer getting from your most expensive person.

The salary is the input to the cost, not the cost. By the time a US AI engineer is productive, you have spent well past their first year's base.

So the honest first-year cost of a US senior AI hire is not the $200,000 you wrote down. It is the loaded $250,000-plus, minus the partial productivity of the first six months, plus the lost productivity of the senior doing the mentoring. Priced properly, the first year of a US in-house senior AI engineer is closer to $300,000 of real cost than to their base salary. That number is the one you should actually be comparing alternatives against.

Offshore and nearshore: what the rate card hides

The moment you see the US loaded number, offshore looks irresistible, and the rate cards are real. A senior AI engineer in Latin America runs roughly $8,000 to $13,000 a month all-in, a 30% to 45% saving versus the US with five to eight hours of time-zone overlap. Offshore in Asia, the same seniority lands at $3,500 to $6,500 a month, a 65% to 75% saving with four to six hours of overlap. Hourly, a US senior runs $150 to $250, a Latin American senior $50 to $90, an Indian senior $25 to $45, with an AI/ML specialization premium of 20% to 40% layered on top.

Those numbers are accurate. What they hide is that the rate card prices the hour, not the outcome. The variable that actually moves your cost is review load: how much of your senior staff's time the engineer's output consumes before it can ship. An engineer who needs a thorough re-review of everything they produce is expensive at any hourly rate, because the scarce, costly resource in an AI team is not the engineer who writes the code, it is the senior who can look at model output and know whether it is correct.

I have made this mistake and paid for it. We brought on an offshore engineer at a rate that looked like a third of the US-equivalent. The code compiled and the demos worked. But every prompt change shipped a subtle regression we only caught in production, because the engineer could not tell a plausible model output from a correct one.

We were spending more senior review time cleaning up the cheap engineer than we would have spent supervising a more expensive one. The blended cost was higher than the rate card promised, and the customer-facing errors had a cost the rate card never showed at all. This is the same dynamic I describe in the judgment economy: when generation is cheap, the value concentrates in the judgment, and judgment is what you are actually buying.

In-house vs staff augmentation vs agency

Strip away the geography and there are three ways to acquire AI engineering capacity, and they fit different situations. This is the build-versus-buy decision, and the right answer depends on how durable your need is and how much hiring risk you can absorb.

In-house full-time is right when the work is core, permanent, and you have the senior bench to onboard and evaluate a hire. You pay the full loaded cost and the full ramp, but you build durable institutional knowledge and the person is fully yours. It is the most expensive option in the first year and often the cheapest in year three, if the hire was good.

Staff augmentation drops a contractor into your team on a monthly rate, with no benefits load and no severance risk, but also no durable ownership. It is right when you need capacity now, the need has a horizon, and you have the in-house seniority to direct and review the work. It trades the ramp-and-retention risk for a higher effective hourly rate.

An agency or product squad gives you a vetted senior engineer, or a small team, who owns delivery end to end, with the review and quality function built in rather than something you have to staff yourself. It is right when you do not have the senior bench to evaluate AI work in-house, or when you need a feature shipped on a timeline and cannot afford a six-month ramp. If you are still deciding whether to build the capability internally at all, an AI strategy and readiness assessment is the cheaper first move than hiring against a plan you have not pressure-tested.

The cost table you can paste into a board deck

Here is the whole decision in one place. The numbers are 2026 US-market figures for a senior AI engineer; adjust for seniority and region, but the relative shape holds.

Option	Typical cost	The tradeoff
US in-house, full-time (senior)	$200K-310K base; ~$300K+ true first-year cost	Highest cost and risk year one; durable ownership; full ramp on you
US staff augmentation (senior contractor)	$15K-22K / month all-in	Fast, no benefits load; higher effective rate; no durable knowledge
Nearshore (LATAM, senior)	$8K-13K / month all-in	30-45% saving, good overlap; review load and vetting are on you
Offshore (Asia, senior)	$3.5K-6.5K / month all-in	65-75% saving; lowest rate, highest review burden if mis-vetted
Transparent-rate agency (senior, vetted)	~$5.5K / month (FTE) to ~$13.5K / month (squad)	Senior-only, review built in; less control than direct hire

The point of laying it out this way is that the cheapest row is almost never the cheapest outcome. The offshore row has the lowest rate and the highest variance: get the vetting right and it is genuinely cheap, get it wrong and the review load erases the saving. The agency row, framed around transparent monthly rates with senior-only engineers, exists precisely to take the vetting and review variance off your plate, which is why the published Devlyn rate for a senior is $5,500 a month and the engineers are 5-to-10-plus-year people, not juniors hidden behind a markup.

The AI engineer cost nobody budgets: a bad hire

Everything above assumes the hire works out. The dominant cost in AI hiring is the case where it does not, and it is larger than most operators let themselves believe. The baseline industry estimate for a bad hire is 30% to 50% of the person's annual salary. For a senior software engineer, once you price in the four-to-nine-month ramp you funded for nothing, the senior time spent mentoring a person who will not work out, and the damage to whatever shipped, the all-in cost of a bad senior hire runs $150,000 to $300,000.

In AI specifically it is worse, because the failure mode is silent. A bad AI engineer does not write code that fails to compile. They write code that compiles, demos cleanly, and ships a model behavior that is confidently wrong in a way nobody catches until a customer hits it.

The cost is not just the wasted salary. It is the customer-facing error, the trust you spend rebuilding, and the senior who now has to forensically unwind work they did not write.

This is the entire case for hiring senior-only, which is the posture I run at Devlyn and explain in my broader argument for cost discipline in AI. The gap between a plausible wrong answer and a correct one is invisible without deep expertise. Hiring someone who cannot see that gap does not reduce your risk, it buries it, and you pay for it later at a much worse exchange rate. The premium for senior judgment is small next to the cost of the bad hire it prevents.

How to think about ROI, not rate

The right unit of cost is not dollars per hour or even dollars per month. It is dollars per shipped, working feature that holds up in production. That reframe changes every decision above.

An engineer at $250 an hour who ships a correct feature in two weeks with light review is cheaper, per outcome, than an engineer at $40 an hour who takes six weeks and consumes forty hours of senior review to get the same feature to the same quality bar. The hourly rate said one was six times cheaper. The cost per shipped feature said the opposite. I have watched both run side by side, and the rate card lied every time.

This is also why the hiring cost and the running cost of the model itself are two different budgets that people conflate. The engineer is what you pay to get a working system built and evaluated. The inference is what you pay to run it at volume.

A good engineer lowers your inference cost by designing a system that routes to smaller models and escalates rarely; a bad one leaves you paying frontier prices on every call forever. The hiring decision compounds into the running cost, which is one more reason to optimize for judgment over rate.

So when you build the budget, price the whole picture: the loaded salary, the ramp, the senior review load, the probability-weighted cost of a bad hire, and the inference architecture the engineer's judgment will shape. Then compare options against that number, not against the rate card. The frameworks for evaluating who actually clears that bar are in Building an AI-Native Team, which is the hiring playbook I wrote from this exact seat.

Frequently asked questions

How much does it cost to hire an AI engineer in 2026?

In the US, expect a base salary of roughly $150,000 to $200,000 for a mid-level engineer and $200,000 to $310,000 for a senior, per Robert Half and Levels.fyi 2026 data. But the fully-loaded first-year cost of a senior in-house hire is closer to $300,000 once you add the 25% to 40% overhead, the partial productivity of a four-to-nine-month ramp, and the senior time spent mentoring. Offshore and nearshore the same seniority runs $3,500 to $13,000 a month all-in.

Is it cheaper to hire an AI engineer offshore?

On the rate card, yes: offshore Asia runs 65% to 75% below US rates and nearshore LATAM 30% to 45% below. On the outcome, not always. The cost that actually moves is review load, how much senior time the engineer's output consumes before it can ship. A poorly vetted cheap engineer can cost more in senior attention than a more expensive, well-vetted one, so vet for judgment, not just rate.

What is the real cost of a bad AI hire?

The baseline is 30% to 50% of annual salary, but for a senior AI engineer the all-in cost runs $150,000 to $300,000 once you include the wasted ramp, the senior mentoring time, and the production damage. In AI the failure is silent, code that compiles and demos cleanly but ships confidently wrong model behavior, which is the core argument for hiring senior-only and evaluating for judgment.

In-house, staff augmentation, or agency, which is cheapest?

It depends on how durable the need is and how much hiring risk you can absorb. In-house is most expensive in year one and cheapest by year three if the hire is good. Staff augmentation trades a higher effective rate for zero ramp-and-retention risk. A transparent-rate agency with vetted senior engineers takes the vetting and review variance off your plate, which is why Devlyn publishes senior monthly rates rather than billing you for the markup on a junior.

AI Engineer Job Description: What to Put In It

Alpesh Nakrani — Sun, 03 May 2026 18:30:00 GMT

A good AI engineer job description names the production problem, separates required from nice-to-have, and avoids the keyword pile that repels your best builders.

A good AI engineer job description does three things: it names the production problem you are hiring someone to own, it separates the handful of skills you actually require from the long list that would be nice to have, and it states the seniority and pay honestly. Everything else is decoration. If your JD reads like a pile of frameworks and acronyms, you have written a filter that screens out the builders you want and lets through the people who are good at matching keywords. A usable copy-pasteable template is further down this page; first, the parts that decide whether it works.

I am an engineer who moved into a CRO seat, and part of my job is writing the reqs and then living with whoever they attract. I have hired and deployed senior AI engineers into products that touch real customers, and I have also written a few job descriptions early on that I am not proud of. The bad ones all failed the same way: they described a person, not a job, and listed twenty tools with zero outcomes. They asked for a researcher and a full-stack engineer and an MLOps owner in one hire, at a salary pegged to last year, and the good candidates read that and kept scrolling.

This piece is the practical version, written from the side of the table that pays for the mis-hire. It sits under my guide to hiring AI engineers; read that for the whole process. Read this for the document itself: what the role owns, the responsibilities that matter, required versus nice-to-have skills, how the JD shifts by seniority and sub-role, the mistakes that quietly kill your pipeline, and a template you can paste and edit today.

Write the production problem first, the JD second. If you cannot name what breaks in production when this hire fails, your JD will become a keyword pile.
Required versus nice-to-have is the whole game. Twenty "required" tools signals you do not know what you need, and it repels the people who do.
"AI engineer" is four different jobs. Application engineer, ML engineer, platform engineer, and researcher need different reqs; one JD for all of them gets you none of them well.
Seniority changes the verbs, not the keywords. Junior owns execution, senior owns design and the decision; a JD that only lists tools cannot tell the difference.
Stale pay is a silent disqualifier. If your band is a year behind the market, the best people self-select out before you ever see them.

Write what the role actually owns before you write the JD

Before you type a single bullet, answer one question: what production problem does this person own? Not what tools they touch. What breaks, in front of a customer or a P&L, when this role is empty or filled badly. That answer is the spine of the whole document.

An AI engineer in 2026 is, in practice, a production-oriented engineer who builds, evaluates, and operates systems on top of foundation models. Analysis of more than a thousand live postings found employers prioritize RAG, LLM integration, Python, cloud infrastructure, and deployment far above fine-tuning or research depth (AI Shipping Labs). The job is shipping and keeping things alive, not publishing.

So write the ownership statement first. Something like: "You own our document-extraction feature end to end: retrieval quality, latency, the eval suite that gates releases, and what happens when the model is wrong in front of a customer." That one sentence does more filtering than a list of fifty technologies, because it tells a strong candidate exactly what they would be responsible for and lets them decide if it excites them.

If you cannot write that sentence, you are not ready to hire yet, and no template will save you. The teams I see struggle most are the ones who open a req because "we need AI" and expect the candidate to figure out the scope. That is not a job description. That is an admission that you have not done the work. And if you would rather not do that work at all, you can skip writing the JD and the three-month search and hire a pre-vetted senior AI engineer who has already shipped this exact shape of feature.

AI engineer roles and responsibilities: the section that does the work

The roles and responsibilities section is where most JDs go soft. They write verbs like "leverage," "utilize," and "collaborate" that describe nobody and commit to nothing. Strong candidates read vague responsibilities as a vague team. Write the ones that show real ownership instead.

A production AI engineer's responsibilities, written honestly, look like this: build and integrate LLM and retrieval features into the product, including the unglamorous parts, auth, permissions, streaming states, error handling. Design and maintain the eval suite that decides whether a change ships. Own latency and cost per request as product metrics, not afterthoughts. Instrument monitoring so you find out the model is failing before a customer does. Iterate after launch, because the first version is never the last.

A job description is a filter you point at yourself. Most teams point it the wrong way and then wonder why only the wrong people apply.

Notice what is missing: "train models from scratch," "publish research," "achieve state-of-the-art." Unless you are a frontier lab, those belong in a different JD entirely. Putting them in an application-engineering req is how you attract people who want to do research and resent the integration work that is actually 90% of the job. The mismatch surfaces in month two, and it is expensive.

One pattern I keep relearning: the responsibilities that look boring on paper are the ones that separate a good hire from an expensive one. "Owns the eval suite" and "owns what happens when the model is wrong" sound dull next to "builds cutting-edge AI." They are also the entire job. Write them anyway. The right candidate reads "owns what happens when the model is wrong" and thinks, finally, a team that gets it. For the deeper version of which abilities actually matter here, see the skills that actually separate the good ones.

Required versus nice-to-have skills, and why the line matters

This is the section that decides who applies. Get the required-versus-nice-to-have split right and you attract builders; get it wrong and you attract resume-matchers. The single most common mistake is listing twenty tools as "required," which signals to a strong engineer that you do not actually know what you need.

The genuine required floor for a production AI engineer is short. Working fluency with LLM APIs and RAG. Enough eval discipline to know when a model is wrong and when "good enough" is actually good enough. Real software engineering, usually Python. Enough MLOps or LLMOps to ship and operate without the system falling over. That is roughly it. Everything else is nice-to-have, and you should label it that way.

Here is the mental model I use. A required skill is one where, if the candidate lacks it, they cannot do the job on day thirty. A nice-to-have is one they can learn on the job or that only matters for a fraction of the work. Specific vector databases, a particular orchestration library, a named cloud provider, those are almost always nice-to-have, because a strong engineer learns a new vector DB in a week. Judgment under production pressure, they cannot acquire in a week, so that is the bar.

I once reviewed a req a founder had written that listed eleven "required" frameworks and libraries. I asked him which three he would actually fire someone for not knowing. He could only name one. The other ten were aspirational, copied from other postings, and every one of them was shrinking his applicant pool for no reason. We cut the list to four real requirements and moved the rest to "bonus." Applications from senior people went up, because senior people read a short, confident requirements list as a sign of a team that knows its own stack.

How the JD changes by seniority and sub-role

One JD cannot serve every level or every flavor of the role, and trying to make it do so is why so many reqs read as a catch-all. Two axes matter: seniority and sub-role. Get explicit about both.

On seniority, the verbs change, not just the years. A junior AI engineer owns execution: building against APIs, integrating models, writing tests, shipping under guidance. A senior owns design and the decision: choosing the approach, defining what "good enough" means, deciding what ships. A staff or lead owns the strategy and the other engineers. A JD that only lists tools cannot express this difference, which is why level-less reqs attract a confusing mix and convert almost nobody. Note too that the entry-level market is thin; only about 2.5% of AI engineering postings target candidates with under two years of experience, while two-to-six years is the common band (365 Data Science). If you write a "junior" req, expect a small and inexperienced pool, and price accordingly.

On sub-role, "AI engineer" is a catch-all for at least four different jobs. An AI application engineer integrates models into product features and owns the user-facing surface, an ML engineer owns models, training, and pipelines, a platform or infra engineer owns the serving and tooling underneath, and a researcher owns novel methods. These need genuinely different JDs. The most expensive hiring mistake I see is a manager opening one "AI engineer" req and silently expecting one person to cover all four, which is exactly the trap the market warns about (KORE1). Pick the sub-role, name it in the title, and write the req for that one job.

The AI engineer job description mistakes that repel the good ones

Most job descriptions do not fail by being too strict. They fail by being incoherent in ways that strong candidates read instantly. Here are the four that do the most damage.

The keyword pile. Twenty tools, no outcomes. A strong builder reads this as a team that hires by buzzword and will manage by buzzword too. The fix is the ownership statement from earlier: name the problem, list four real requirements, and let the tools be implementation details.

The catch-all title. "AI Engineer" with no sub-role, expecting researcher plus full-stack plus on-call in one hire. Candidates cannot tell what the job is, so the strong ones, who have options, pass. The fix is to name the sub-role in the title.

Stale compensation. This one is quiet and lethal. Senior AI engineer pay has re-priced hard; market guidance now treats a base near $200K as a floor for senior roles, with total comp frequently climbing past $300K (KORE1 salary guide). If your band was set last year, the best people see the number, do the math, and never apply. You will conclude "there is no talent" when the truth is your offer is a year behind.

Research cosplay. Asking for a PhD and publications for what is, in reality, an integration and operations job. It does not raise your bar; it just attracts people who will be unhappy doing the actual work. Match the requirements to the job you have, not the job that sounds impressive.

A copy-pasteable AI engineer job description template

Here is a template you can paste and edit. It is written for an AI application engineer, the most common production hire; swap the ownership line and requirements if you are hiring a different sub-role. Keep it short, keep the required list to four or five real items, and fill in a current pay band.

TITLE: Senior AI Application Engineer WHAT YOU'LL OWN You own [the feature, e.g. our document-extraction flow] end to end: retrieval quality, latency, the eval suite that gates releases, and what happens when the model is wrong in front of a customer. RESPONSIBILITIES - Build and integrate LLM and RAG features into the product, including auth, permissions, streaming states, and error handling. - Design and maintain the eval suite that decides what ships. - Own latency and cost-per-request as product metrics. - Instrument monitoring; catch model failures before customers do. - Iterate after launch based on real usage. REQUIRED (the day-30 floor) - Working fluency with LLM APIs and RAG. - Eval discipline: knows when the model is wrong and when good enough is. - Strong software engineering, Python. - Enough MLOps/LLMOps to ship and operate at scale. NICE TO HAVE (learnable, do not gate on these) - A specific vector DB, orchestration library, or cloud provider. - Fine-tuning experience; domain experience in [your industry]. SENIORITY & PAY Senior: owns design and the call on what ships. Base [current band], total comp [current band]. [Location / remote policy.]

That is the whole thing. Notice it commits to outcomes, keeps the required list honest, separates nice-to-have explicitly, names the sub-role in the title, and forces you to put a real number in. If you fill it out and the pay band makes you wince, that is the market talking, and it is better to hear it now than after a three-month empty search.

One table: what to put in each section, and what to avoid

The same advice, section by section, in one place you can scan while you write.

JD section	What to include	Mistake to avoid
Title	The specific sub-role (application, ML, platform, research)	Bare "AI Engineer" catch-all that means four jobs
Ownership line	The production problem this hire owns end to end	A person description with no stated outcome
Responsibilities	Concrete verbs: build, evaluate, own latency, monitor, iterate	"Leverage," "utilize," "collaborate" that commit to nothing
Required skills	The 4–5 day-30 must-haves, no more	Twenty "required" tools that signal you don't know what you need
Nice-to-have	Learnable tools, specific DBs/libraries, domain bonus	Hiding learnable skills inside "required"
Seniority	The verbs for the level: execution vs design vs strategy	Level-less req that converts nobody
Compensation	A current, market-rate band	A band pegged to last year's data

If you want the full hiring picture around this document, the funnel, the vetting, the cost in-house versus outsourced, that is what my guide to hiring AI engineers covers, and the playbook for building the team around the hire is in Building an AI-Native Team.

Frequently asked questions

What should an AI engineer job description include?

An ownership line naming the production problem the hire owns, concrete responsibilities written as real verbs, a short required-skills list of four to five day-30 must-haves, an explicit nice-to-have list, the seniority level expressed as the verbs for that level, and a current market-rate pay band. The template above gives you all of these in a structure you can paste and edit.

What are an AI engineer's main roles and responsibilities?

For the common production role: build and integrate LLM and RAG features including the unglamorous auth, permissions, and error-handling work; design and maintain the eval suite that gates releases; own latency and cost per request as product metrics; instrument monitoring to catch failures before customers do; and iterate after launch. Training models from scratch and publishing research belong in a researcher's JD, not an application engineer's.

What skills should be required versus nice-to-have?

Require only what the candidate must have by day thirty: working fluency with LLM APIs and RAG, eval discipline, strong software engineering and Python, and enough MLOps to ship and operate. Mark specific vector databases, orchestration libraries, cloud providers, and fine-tuning as nice-to-have, because a strong engineer learns those in a week. Judgment under production pressure is the real bar, and it cannot be learned in a week.

Why does my AI engineer job posting not attract good candidates?

Usually one of four reasons: the JD is a keyword pile with no stated outcome, the title is a catch-all that means four different jobs, the compensation band is pegged to last year and the best people quietly self-select out, or it asks for research credentials for what is really an integration job. Fix the ownership line, name the sub-role, refresh the pay band, and match the requirements to the actual work.

If writing the JD, posting it, and running a three-month search is not how you want to spend the next quarter, you can skip the search and hire a pre-vetted senior AI engineer who has already shipped production features of this exact shape. Either way, the principle holds: name the problem, keep the requirements honest, and the right people will recognize the team that gets it.

How to Vet AI Engineers: The Process That Predicts

Alpesh Nakrani — Sat, 02 May 2026 18:30:00 GMT

How to vet AI engineers in a way that predicts on-the-job performance: the work-sample that mirrors real work, the judgment probe, references, and a paid trial.

How to vet AI engineers, in the order that actually predicts whether they will perform on the job: read their portfolio for evidence they shipped and measured something real, run one work-sample that mirrors the actual job instead of a coding test, probe how they knew their system worked and what they did when it was wrong, check references for patterns rather than praise, and end on a short paid trial that puts them next to real work. The resume comes last in weight, not first. Everything that feels efficient to over-weight, the credential, the leetcode round, the model trivia, is the part that has stopped predicting anything.

I vet AI engineers for a living. I run the loop, I hire them, I deploy them on revenue-bearing work, and then I find out within a few weeks whether my vet was any good. That feedback loop is brutal and it is honest, and it has rewritten how I screen. The first versions of my process selected for people who interviewed beautifully and then froze the first time an eval set disagreed with their gut. This piece is the process I run now, written from the hiring seat. If you are reading from the other chair, treat it as a map of what a careful evaluator is actually trying to learn about you.

Resume signal is the weakest signal. Credentials, framework lists, and model trivia are all cheap to fake and easy to acquire in 2026. They are a filter for the obvious no, not a predictor of the yes.
The work-sample must test judgment, not typing. With around 84% of developers now using AI coding tools, raw code output no longer reflects raw capability. Give them a real eval or debug task, not a blank-file coding round.
"How did you know it worked?" is the fastest predictive filter. An engineer who cannot describe what they measured, what broke, and where a human sat in the loop has not shipped production AI, regardless of how the demo looks.
Structured beats unstructured by roughly 2x. Same task, same rubric, scored on a scorecard. Gut-feel interviews are about half as predictive of on-the-job performance.
The paid trial is the only test that fully predicts. A few days of real-ish work next to your team tells you more than every interview round combined.

Start with the signal you can trust: portfolio and GitHub, read correctly

Before any interview, I read what the candidate has actually built, and I read it the way a skeptic reads a P&L. The point of the portfolio pass is not to be impressed. It is to find the one project where they owned an AI system in production and to see whether they talk about it like someone who lived with its failures or someone who wrote it up for a resume.

What I read for: evidence of an eval set, even a crude one. A README that admits what the system gets wrong. A commit history that shows them fixing a real failure mode rather than chasing a green demo. A writeup that names a metric and a number, not just "improved accuracy." Those are the fingerprints of someone who has shipped and been held accountable for the output.

What I ignore: GitHub star counts, the length of the framework list, and the name of the vector database. None of those predict whether the person can tell a correct model output from a confidently wrong one. A repo with three hundred stars and no eval harness is a marketing artifact. A quiet repo with a frozen test set and an honest failure log is a hiring signal. To vet AI engineers well, you have to learn to read past the polish to the part that shows judgment, which is the same thing the hiring AI engineers pillar argues you are really buying.

A repo with three hundred stars and no eval harness is a marketing artifact. A quiet repo with a frozen test set and an honest failure log is a hiring signal.

The work-sample that mirrors the actual job, not a coding test

The single highest-validity thing you can do is give the candidate a task that looks like the work, then watch how they approach it. Work-sample tests have been near the top of the predictive-validity tables for decades, and they got more important the moment coding stopped being the bottleneck. The catch in 2026 is that the classic take-home no longer measures what it used to.

Here is the problem. The old take-home assumed the code a candidate produces reflects the candidate's capability. That assumption is dead. Around 84% of developers now use or plan to use AI tools in their workflow, per the most recent Stack Overflow Developer Survey, so a blank-file coding round mostly measures how well someone prompts a model you would have given them anyway. You are not learning what you think you are learning.

So I changed what the work-sample tests. Instead of "write this function," I give a task that is judgment-shaped and AI-resistant: here is a small RAG support bot that returns confident wrong answers about a third of the time, find out why and tell me what you would change. Or: here is an eval set and two model outputs, score them, defend your rubric, and tell me what the set is missing. The candidate can use any tool they want, because the thing I am scoring is not the typing. It is whether they reach for a measurement, whether they form a hypothesis before they touch code, and whether they can tell me what they are uncertain about.

One illustrative loop from my own files. A candidate with a thin resume and a state-school degree took the broken-RAG task, spent the first ten minutes building a tiny eval harness before changing anything, found the chunking bug, and then said the sentence that got him hired: "I would not ship this until I had thirty more labeled failures, this fix is a guess on a sample of one." He was right, and he was the strongest hire of that cohort. The credentialed candidate who skipped straight to a prompt tweak and declared it fixed was the one I would have over-weighted on paper.

Keep the work-sample short and paid if it runs past an hour. Score it on a written rubric you wrote before the session, because a structured work-sample scored the same way for every candidate is roughly twice as predictive as the same hour run on instinct. Schmidt and Hunter's long-standing meta-analysis put structured interviews at around 0.51 validity against 0.38 for unstructured, and more recent work from Sackett and colleagues revised the gap even wider, to roughly 0.42 versus 0.19. The discipline of sameness is not bureaucracy. It is the part that makes the comparison mean anything.

The fastest way to assess AI engineers: how they knew it worked

If I could keep only one question, it would be this: how did you know it worked? Then I follow the thread wherever it goes. What did you measure? On what set? What did the set miss? When the model was confidently wrong in front of a user, what did you do in the next hour, and what did you change so it would not happen again?

This is the question that separates engineers who shipped from engineers who demoed. Anyone can describe a RAG architecture in 2026; the docs are everywhere and the model will recite them for you. What cannot be faked is the texture of having owned a system that failed in front of real people. The engineer who has lived it answers with a metric, a failure mode, and a regret. The engineer who has not answers with an architecture diagram. The eval discipline I am probing for is the same muscle the whole field of LLM evaluation is built on, and it is the clearest dividing line between the two kinds of candidate.

I also probe for how they handle being wrong inside the interview itself. I will push back on a correct answer to see whether they fold or hold their ground with evidence, and I will hand them a genuinely ambiguous call to see whether they say "I do not know yet, here is how I would find out" or bluff. The bluff is disqualifying for production AI work, because a model that is confidently wrong is the whole job, and an engineer who is confidently wrong about the model is a force multiplier for the failure. The questions that do this well are the ones I collected in the piece on AI engineer interview questions; the short version is that you are testing for calibrated honesty, not recall.

Reference checks done right, including the backchannel

Most reference checks are theater. You call the three names the candidate handed you, you ask whether they would work with the person again, the reference says yes, and you have learned nothing. Done right, references are one of the highest-value parts of the process, and the trick is to ask for patterns instead of praise.

I ask references for the texture a resume hides. Tell me about a time the candidate was wrong and how they handled it. What would their last team say they need to work on. When something broke in production, were they the one who reached for the logs or the one who reached for an excuse. Then I shut up and let the pauses do the work, because the hesitation before a polite answer often says more than the answer.

The backchannel is where the real signal lives, and it has to be handled ethically. A backchannel reference is someone who worked with the candidate but is not on their list, reached through your own network. The hard rule is that you never contact anyone at the candidate's current employer, because exposing that someone is interviewing can cost them their job. Done within a trusted network and weighed only on patterns rather than one bitter anecdote, a backchannel will confirm your excitement or surface the blind spot the curated references were chosen to hide.

Most reference checks are theater. Done right, they are where you confirm your excitement or finally see the blind spot the curated names were chosen to hide.

The paid trial: the only test that fully predicts

Every method above is a proxy. The paid trial is the real thing. A few days to two weeks of real-ish work, paid at a fair rate, next to your team, tells you what no interview can: how they communicate mid-task, how they handle a vague spec, whether their first instinct under real pressure is to measure or to guess, and whether the people around them want more of it.

I scope the trial around an actual ticket or a sanitized version of one, never a contrived puzzle. I watch three things specifically: do they ask the clarifying question before they build, do they leave the codebase more measurable than they found it, and do they tell me when they are stuck instead of going dark for two days. A second illustrative case from my files: a candidate who aced every interview round went quiet for three days during the trial, then surfaced a half-finished branch with no tests and no questions asked. The interviews said yes. The trial said no, and the trial was right. We had spent maybe a few hundred dollars to avoid a hire that would have cost months.

The honest trade-off is that strong candidates have options and will not all do a trial, and a genuinely mission-critical full-time seat sometimes warrants going straight to an offer on the strength of the work-sample and references. But for contractors, for fractional work, and for any hire where you can afford the time, the paid trial is the cheapest insurance you will ever buy against a six-figure mistake. If you would rather not run this gauntlet yourself, the engineers we place through Devlyn's AI application engineer hiring have already cleared a version of it on live production work, so the trial is effectively done before you meet them.

What NOT to over-weight

The mirror image of a good vet is knowing what to discount, because most broken hiring processes are not missing a step. They are over-weighting the wrong ones. Here is what I have learned to discount, and why each one feels predictive but is not.

Credentials and pedigree. A degree or a brand-name former employer tells you the person cleared someone else's bar years ago, not that they can tell a correct model output from a confidently wrong one today. I have hired state-school self-taught engineers who ran circles around PhDs on production judgment, and the reverse, and neither credential predicted the outcome.

Leetcode and algorithm puzzles. They measure a narrow, coachable skill that has almost no overlap with the actual job of shipping reliable AI features. Model trivia is the same trap one layer up: reciting attention mechanisms or naming the newest model is recall, and recall is the cheapest thing on the market now. The AI engineer skills that actually matter are judgment skills, and none of them show up on a whiteboard.

Demo polish and raw take-home output. A beautiful demo selects for presentation, and a clean take-home now selects for prompt quality, not capability. Both look like competence and predict almost nothing about how the person behaves when an eval set disagrees with them at 11pm before a launch.

A scorecard to vet AI engineers you can run

Here is the whole process as one scorecard: each signal, the weight I give it, and how I actually test it. The weights are mine and they are deliberately tilted toward the things that have predicted on-the-job performance in my own hiring, not toward the things that are easy to measure.

Signal	Weight	How to test it
Eval and judgment ("how did you know it worked")	High	Live probe: metric, set, failure mode, what they changed
Work-sample on a real, AI-resistant task	High	Debug a broken RAG bot or score an eval set; rubric-scored
Paid trial on real-ish work	High	3-10 days, real ticket, watch communication and measurement
References, read for patterns	Medium	Pattern questions plus an ethical backchannel
Portfolio / GitHub depth	Medium	Read for eval sets and honest READMEs, not stars
Handles being wrong (calibration)	Medium	Push back in-interview; reward "I do not know yet, here is how I would find out"
Credentials and pedigree	Low	Filter for the obvious no only; never a tiebreaker
Leetcode / model trivia	Very low	Skip; it measures recall, not production judgment

Run the high-weight rows on every candidate in the same order with the same rubric, and you have a structured process that is roughly twice as predictive as the interview most teams actually run. The discipline is dull and the payoff is enormous, which is the usual shape of things that work. The full hiring loop this scorecard sits inside, including what good costs and how the bad hires fail, lives in the hiring AI engineers guide, and the deeper org playbook is in my book The AI-Native Team.

Frequently asked questions

How do you vet an AI engineer who has never shipped to production?

Substitute scope for production scars. Give the judgment-shaped work-sample, the broken-RAG or eval-scoring task, and weight how they reason about measurement and uncertainty rather than what they have shipped. A junior who builds a tiny eval harness before changing code is showing you the exact muscle you are buying. Just price the role and the risk accordingly, and lean harder on the paid trial.

Is a take-home test still worth giving in 2026?

Only if it tests judgment instead of typing. A blank-file coding round mostly measures how well someone prompts an AI tool, since around 84% of developers now use them. A take-home that asks the candidate to debug a confidently-wrong system or critique an eval set still works, because those are hard to fake with a model and they reveal how the person thinks.

How long should the paid trial be?

Long enough to see real behavior, short enough to be cheap insurance: three to ten days on a real or sanitized ticket is the usual range. You are watching for clarifying questions, measurement instinct, and honest communication when stuck. For a mission-critical full-time seat you can sometimes skip it on the strength of a strong work-sample and references, but for contract and fractional work it is the highest-value step you have.

What is the single fastest way to assess AI engineers?

Ask "how did you know it worked," then follow the thread. An engineer who answers with a metric, a set, and a failure mode has shipped and owned production AI. One who answers with an architecture diagram has not. It is not a complete vet on its own, but no other single question filters faster.

If you want the full hiring loop this fits into, the hiring AI engineers guide covers sourcing, cost, and failure modes, and The AI-Native Team goes deeper on the org around the hire. And if you would rather skip the whole gauntlet, the engineers placed through Devlyn's AI application engineer hiring have already cleared a harder version of this process on live work. Vet for judgment. Discount the rest.

Senior vs Junior AI Engineer: The Real Difference

Alpesh Nakrani — Fri, 01 May 2026 18:30:00 GMT

Senior vs junior AI engineer is no longer a question of years. It is whether they can evaluate what the model generated, not just generate it. AI widened that gap.

The real difference between a senior vs junior AI engineer was never the number of years on a resume, and it is even less so now. It is this: a senior can look at what the model just produced and tell you, fast, whether it is correct, why it is wrong when it is wrong, and what to change before it touches a customer. A junior can produce the same first draft just as quickly, but cannot yet evaluate it. That single gap, generation versus evaluation, is the whole ballgame, and AI did not close it but widened it.

Here is the short version of when each fits. Hire a senior when the cost of a confident wrong answer is high, when the surface is customer-facing, or when nobody else on the team can catch an "almost right" solution. Hire a junior when you have real senior review capacity to spare, the surface is low-stakes, and you are deliberately investing in the pipeline that turns juniors into seniors. The trap is hiring a junior because they are cheap, on a high-stakes build, with no senior watching, which is not saving money but burying risk where you will find it later, in production, in front of someone who paid you.

I have hired on both sides of this, and I have sat in the seat where a wrong answer became a customer-service problem in a physical store. So I want to walk through what actually separates the two levels, why the AI tooling made the gap bigger rather than smaller, where a junior genuinely is the right call, and the cost-versus-risk math that most hiring decks skip. This sits under my broader take on hiring AI engineers for judgment, not a resume, which is the pillar this argument hangs off.

Key takeaway: The senior vs junior AI engineer line is evaluation, not years. Seniors can judge whether model output is correct; juniors can generate it but not yet reliably judge it.
AI widened the gap, it did not close it. Generation got cheap and fast for everyone. Evaluation did not. The thing AI cannot do for you is the thing that separates the levels.
A junior is the right call when review capacity is real. Low-stakes surface, a senior actually reviewing, and a deliberate pipeline investment. Not as a cheap substitute on a high-stakes build.
The cheap salary is a partial price. The full cost of a junior includes senior review overhead plus the expected cost of a wrong answer that ships. Price the loaded number.
The ratio tilts senior. When generation is free, one senior who can evaluate is worth several producers. Hybrid teams work, but only when judgment density is high enough to catch the misses.

What actually separates a senior from a junior (it was never years)

Strip away the tenure and the title inflation and the difference comes down to one capacity: confident evaluation. A senior AI engineer reads a model output, a piece of generated code, a retrieval result, a tool-call sequence, and knows whether it is good enough to ship. Not "looks plausible." Good enough to ship, against the failure modes that actually matter for this product, with a reason they can say out loud. That is judgment, and judgment is what you are buying.

A junior is not worse at typing or slower at producing; in 2026 they are often faster, because the model does the producing and the junior is good at prompting it. What the junior lacks is the calibration to tell a plausible wrong answer from a correct one, and that gap is invisible to anyone who has not been burned by it. A generated function that compiles, runs, and returns something reasonable can still be wrong in a way that only shows up at the edge case, under load, or on the one input that matters most.

This is the same point I make about choosing models by eval, not vibe: the hard part of AI work is not generating an output, it is proving the output is correct. Seniority, in this field, is mostly the accumulated scar tissue of having shipped wrong answers and learned to see them coming. You cannot interview for years and expect to get that. You have to interview for the seeing.

The practical tell I look for: hand a candidate a generated artifact and ask "what is wrong with this, and what would you need to know before you shipped it?" A senior interrogates the constraint space, the failure modes, the inputs they have not seen. A junior, trained for throughput, usually pivots straight to how they would produce something better. Producing is not the question. The question is whether they can see the gap.

How AI widened the gap: juniors can generate, but not evaluate

For most of software's history, the junior path was simple: you produced clearly-specified work, slowly, a senior reviewed it, and over time you absorbed their judgment. Production was the bottleneck, so the junior added value by adding throughput, even unreviewed throughput, because human production was scarce. That world is gone: when a capable model can produce a working draft in seconds, throughput stops being scarce. Judgment becomes the scarce thing, and a junior, by definition, has not built it yet.

So the gap widened in two directions at once. The senior got more reach, able to evaluate and steer far more output per hour than before, while the junior's traditional contribution, raw production, got commoditized out from under them. The economics of this are not subtle, and the field has noticed. A 2025 industry survey of engineering leaders found that 54% plan to hire fewer juniors over the long term as a direct result of AI coding tools, and that 37% would rather deploy an AI tool than hire a recent graduate (LeadDev AI Impact Report 2025).

Generation got cheap and fast for everyone. Evaluation did not. AI did not close the gap between senior and junior; it commoditized exactly the half a junior was good at.

There is a second-order effect that makes this worse, not better, for teams that skimp. AI generates more code, faster, which means more "almost right" merges, more duplication, more nearly-correct solutions slipping toward production. Industry analysis through 2026 is consistent on this point: AI raises the need for senior review rather than lowering it (DistantJob, AI vs Junior Developers). The volume of output went up, and the volume of output that requires a trained eye to vet went up with it.

Put those together and you get the uncomfortable shape of the labor market. By 2026, only a small fraction of AI engineering postings target zero-to-two-year candidates; most want two-to-six years, and the new-graduate share of hires has fallen sharply from where it sat a few years ago. The machine did the junior's old job, but it cannot do the senior's job, which is to know whether the machine got it right. I have written about how this reshapes the whole org in what a team is for after the machine does the work; the seniority question here is the micro version of that macro shift.

When a junior AI engineer is the right call

I am not arguing juniors are bad hires. I am arguing they are the wrong hire in the specific situation most people hire them for, which is "we need this built and a junior is cheaper." There are real situations where a junior is exactly right, and pretending otherwise is how the field eats its own seed corn.

A junior is the right call when three things are true at once. First, the surface is low-stakes: internal tooling, a prototype, a feature where a wrong answer costs an afternoon, not a customer. Second, you have genuine senior review capacity, a senior who will actually read the work and whose time is budgeted for it, not a senior already underwater. Third, you are treating the hire as a deliberate investment in your pipeline, because juniors become seniors by doing reviewable work and getting it reviewed, and there is no future senior bench if no one ever hires a junior again.

That third point matters more than the hiring math suggests, because every junior you develop into a senior is institutional knowledge that does not walk out the door when someone resigns. The screening signal I trust for a junior in the AI era is narrow and brutal: can they explain a fifty-line AI-generated snippet line by line, including why it is correct? If they can, they are learning to evaluate, which is the thing that turns into seniority. If they cannot, you have hired prompt-and-paste, and you are paying senior salaries to catch what they miss.

None of this is the right shape for a build where the cost of being wrong is high and the senior bench is thin. On a customer-facing AI feature with money or trust on the line, a junior without dense senior review is not a discount. It is a deferred bill. For the situations where you do hire a junior, my notes on the skills that actually matter for an AI engineer are a better filter than the framework checklist most job posts lead with.

The cost-vs-risk math nobody puts on the table

Hiring decks compare salaries. A junior costs less than a senior, the line item is smaller, the decision looks obvious, but that comparison is dishonest, because salary is a partial price. The full cost of a junior on an AI build is the salary, plus the senior review overhead their work requires, plus the expected cost of the wrong answers that slip through review and reach production. Until you price all three, you are not comparing costs; you are comparing the cheapest of three numbers and pretending it is the total.

Let me sketch the loaded math, illustratively, so the shape is visible. The dollar figures below are a model, not numbers from any specific engagement.

// Illustrative loaded-cost sketch, not from a live engagement junior_salary_monthly = $9,000 // cheaper line item senior_review_overhead = $4,500 // ~30% of a senior's time vetting junior output expected_wrong_answer = $6,000 // p(ship a bad answer) x avg cleanup + trust cost junior_loaded_monthly = $19,500 // the number the deck does not show senior_salary_monthly = $16,000 // bigger line item senior_review_overhead = $1,000 // self-reviews; catches own misses expected_wrong_answer = $1,200 // far lower p(ship) on high-stakes surface senior_loaded_monthly = $18,200 // on a high-stakes build, the senior is cheaper

The point of the sketch is not the exact figures, which vary by team and surface; the point is the direction. On a low-stakes surface with cheap mistakes and slack senior review, the junior's loaded number stays low and the hire is genuinely economical. On a high-stakes surface, the expected cost of a wrong answer and the review overhead climb fast enough that the "cheaper" junior is the more expensive choice. The cost of being wrong is a real line item, and it scales with how much a wrong answer in your product actually hurts.

I learned to price this the hard way. At Devlyn we build AI into a retail experience where a wrong recommendation is not an abstraction, it is an employee fixing it in front of a customer who just wanted help picking frames. In that environment the expected-wrong-answer term dominates the math, which is why the loaded cost of an unreviewed junior is far higher than the salary suggests. The full version of this calculation, in-house versus staff aug versus agency, is in my breakdown of what an AI engineer actually costs.

Senior vs junior AI engineer: a comparison you can paste into a hiring deck

Here is the senior vs junior AI engineer comparison stripped to the dimensions that actually drive the decision. Read it as a "where does each fit," not a "senior good, junior bad," because the right answer depends on your surface and your review bench.

Dimension	Junior AI engineer	Senior AI engineer
Core strength	Generation: prompts the model, ships a working draft fast	Evaluation: knows whether the output is correct and why
The gap AI widened	Cannot yet tell a plausible wrong answer from a right one	Calibrated to see the wrong answer before it ships
Review burden they create	High: needs senior review on anything that matters	Low: catches own misses, reviews others
Best surface	Low-stakes, internal, prototype, reversible mistakes	Customer-facing, money or trust on the line
Loaded cost	Salary + review overhead + expected cost of wrong answers	Higher salary, far lower review and error cost
What you are really buying	Throughput and a pipeline investment	Judgment under production pressure
Right call when	Senior review capacity is real and stakes are low	The cost of a confident wrong answer is high

Hybrid teams, and why the ratio tilts senior

The honest answer for most teams is not "all senior" or "all junior." It is a hybrid, with a ratio that has shifted hard toward senior over the last two years. When generation was the bottleneck, you wanted several producers per reviewer, because production was the scarce input. When generation is free, that math inverts. One senior who can evaluate is now worth several producers, because the producing is the part the machine already does.

So the working ratio I see hold up is a small number of seniors who own outcomes and evaluate, supported by a junior or two who are explicitly being developed, on surfaces where their misses are cheap and reviewable. The senior is not there to write more code. The senior is there to set the spec precisely enough that the model produces the right thing, and to catch it when the model produces a plausible wrong thing instead. The junior is there to learn that, by doing reviewable work under someone who can see the gap.

When generation is free, the bottleneck is confident evaluation. One senior who can judge output is worth several who can only produce it. The ratio tilts senior because the scarce skill changed.

The failure mode of the hybrid is subtle and common. A team hires three juniors and one senior to "scale," and the senior becomes a full-time bottleneck reviewing junior output, which means they stop doing the high-value evaluation and architecture work you actually hired them for. You did not scale; you converted your most valuable person into a review queue. If the senior cannot keep up with the review load, the junior output ships unreviewed, which is the worst of both worlds: junior judgment at senior prices, with the risk buried in production where it costs the most.

The senior-only argument (and what it costs you not to)

At Devlyn the posture is explicit: senior engineers only, no juniors hidden behind AI. I want to be precise about what that does and does not mean, because it is easy to misread as snobbery. It is not a statement that junior engineers are bad. It is a statement about what an AI delivery team is actually for, and where the risk lives.

When you ship AI features, the dangerous failure is not the obvious bug. It is the plausible wrong answer, the output that looks right, passes a casual glance, and is wrong in a way that only deep expertise catches. "Juniors hidden behind AI" is the staffing model where someone who cannot see that gap is producing model output and shipping it because it looks fine; the AI makes their output look senior, but it does not make their judgment senior. The gap between a plausible wrong answer and a correct one is invisible without expertise, so hiring people who cannot see it does not reduce your risk, it just moves the risk somewhere you will not find it until a customer does.

So the senior-only delivery model is a risk decision, not a status one. Every engineer who touches the output can read it and know whether it is correct, which means there is no unreviewed layer where a confident wrong answer can slip through to a customer. That is the whole argument, and it is why if you are buying delivery rather than building a pipeline, you want the people who can evaluate, not the people who can only generate. If you need senior AI delivery without the hidden-junior risk, you can hire a senior AI application engineer at Devlyn, where senior-only is the staffing model, not a nice-to-have.

The framework behind all of this, how you actually evaluate for judgment instead of throughput, is the subject of Building an AI-Native Team: Hiring for judgment, not throughput. The interview has to change as much as the org chart does. You are not testing for production speed anymore. You are testing for the ability to specify, evaluate, and own, which is exactly the capacity that separates a senior from a junior in the first place.

Frequently asked questions

What is the real difference between a senior and junior AI engineer?

It is evaluation, not years. A senior AI engineer can look at what a model generated and reliably judge whether it is correct, why it is wrong when it is wrong, and what to change before it ships. A junior can generate the same output just as fast but cannot yet tell a plausible wrong answer from a right one. AI made generation cheap for both levels, so the evaluation gap is now the whole difference.

Do I need a senior AI engineer, or will a junior do?

Hire a senior when the cost of a confident wrong answer is high, when the work is customer-facing, or when no one else can catch an "almost right" solution. A junior is the right call when the surface is low-stakes, you have real senior review capacity, and you are deliberately investing in your pipeline. The mistake is hiring a junior because they are cheap, on a high-stakes build, with no senior reviewing.

Did AI make junior AI engineers obsolete?

No, but it commoditized their traditional contribution. Raw production was a junior's value, and the model now does that part. What survives is the development path: juniors become seniors by doing reviewable work under someone who can evaluate it. Cutting all junior hiring eats the future senior bench, which is why the answer is fewer juniors on the right surfaces, not zero.

How do I screen a junior AI engineer in 2026?

Hand them a fifty-line AI-generated snippet and ask them to explain it line by line, including why it is correct and where it could be wrong. If they can, they are learning to evaluate, which is the path to seniority. If they cannot, you have hired prompt-and-paste, and the risk they create lands on whoever reviews their work, if anyone does.

If you would rather skip the seniority gamble entirely and put a senior AI engineer on the build from day one, that is the work the Devlyn team does, senior-only, with no junior judgment hidden behind the model output.

In-House vs Outsourced AI Development: The Decision

Alpesh Nakrani — Thu, 30 Apr 2026 18:30:00 GMT

I have built in-house AI teams and delivered as the outsourced partner. Here is the framework, not the sales pitch, for choosing between them.

The in-house vs outsourced AI development decision is not really a cost question, even though every vendor will try to make it one. It is a question of which capability is your moat. If the AI is the product and the model behavior is the thing customers pay for, you build it in-house and you accept the cost and the ramp. If the AI is a feature you need to ship correctly and quickly, on a capability you do not intend to own forever, you outsource it or you augment your team, and you move on. Almost everything else is detail.

I want to be honest about my own position before I say anything else, because it changes how you should read this. I run Devlyn, which means I sell outsourced AI development. I have also spent years on the other side of the table, building in-house engineering teams and living with the consequences. So I have sat in both seats, and I have watched both decisions go badly. The most expensive mistakes I have seen were not "we outsourced when we should have built" or the reverse. They were teams that never decided which capability was theirs to own, and ended up paying in-house prices for outsourced-quality results, or outsourcing the one thing that was actually their differentiation.

This article is the framework I use when a founder asks me, off the record, what they should actually do. I will lose some business by being straight about when in-house is the right call. I would rather lose it than sell you the wrong shape of team.

Key takeaway: The choice is not cost-first. It is "is this capability my moat?" If yes, build in-house. If no, outsource or augment and keep moving.
In-house wins when AI is the product, you have a multi-year roadmap, the data is regulated, or the model behavior itself is your differentiation.
Outsourcing wins when you need capability in weeks not months, you are validating before you commit, or the scope is narrow and bounded.
Hybrid is the default, not the compromise. Internal owns context, product, and IP; the partner brings methodology and execution capacity.
IP and control are where deals actually die. Decide data residency, IP assignment, and exit terms before the first line of code, not after.

The decision, stated plainly

Strip away the TCO spreadsheets and the build vs outsource AI team think-pieces and the decision reduces to one question: is this AI capability a thing you must own to win, or a thing you must have to operate? Those are different. Ownership is about moat. Operation is about function.

If a capability is your moat, the value compounds the longer your own people work on it. They accumulate domain knowledge, they tune the model behavior against your specific data, and that advantage is hard for a competitor to copy because it lives in your team's head and your eval sets. You want that on the payroll. You do not want it walking out the door at the end of a statement of work.

If a capability is operational, the opposite is true. You want it correct, fast, and off your plate, and you do not care whose head the knowledge lives in as long as it ships. Paying to develop deep institutional expertise in something you do not intend to differentiate on is just a slower, more expensive way to get a commodity result.

The choice is not in-house vs outsourced AI development on cost. It is whether the capability is a thing you must own to win, or a thing you must have to operate.

Most teams skip this step. They jump straight to comparing day rates against salaries, which is the wrong frame, because it answers "which is cheaper this quarter" when the real question is "which builds the asset I am trying to build." Get the moat question right first. The cost math is downstream of it.

In-house vs outsourced AI development: the real tradeoffs

There are four axes that actually move this decision. Cost, speed, control, and talent access. Every honest comparison lives in how these four trade against each other, so let me take them one at a time without the spin.

Cost. In-house is a high fixed cost. A small AI team of two or three senior people runs somewhere in the range of $500K to $1.5M a year once you count loaded salaries, recruiting, tooling, and the productivity you lose during ramp. Outsourcing converts that into a variable cost you can turn up or down, which matters enormously when you are not yet sure the use case will pay off.

Speed. This is where the gap is widest and least appreciated. If you need AI in production in eight weeks, in-house is simply not on the table, because you cannot recruit, hire, and ramp a senior engineer in that window. A capable partner can be producing in two to four weeks because the team already exists and already knows how to ship this kind of work.

Control. In-house gives you full control over architecture, priorities, IP, and the data path. Outsourcing trades some of that control for speed and flexibility. How much you trade depends entirely on the engagement model, which is why the staff-augmentation middle exists, and I will get to it.

Talent access. This is the one founders underestimate in 2026. There is a genuine, severe shortage of people who can build production AI. PwC's 2025 Global AI Jobs Barometer found a 56% wage premium for AI skills, more than double the prior year, and Robert Half's 2026 hiring data shows AI and machine-learning roles among the hardest to fill. A large majority of employers report they cannot fill AI roles. Outsourcing is, in part, a way to rent access to a talent pool you cannot reliably hire from on your own timeline.

When in-house AI development wins

Building in-house is the right call more often than a partner like me will admit in a sales call. Here is when I tell people to build, even though it costs me the deal.

Build in-house when the AI is the product, not a feature. If your company's entire reason to exist is the quality of a model's behavior, that capability cannot live outside your walls. The compounding domain knowledge is the business. Outsourcing it would be like a restaurant outsourcing its kitchen.

Build in-house when you have a genuine multi-year AI roadmap. The fixed-cost math that looks ugly in year one looks very different over five years if the capability is central and continuously evolving. A standing team that deepens its understanding of your domain every quarter beats a series of project engagements that each start cold.

Build in-house when the data is regulated or the IP is the moat. If your training data cannot legally leave your perimeter, or if the model weights and eval sets you develop are themselves the competitive asset, the control premium is worth paying. Some advantages only exist if nobody outside the company ever touches them.

I worked with a healthtech company that agonized over this. Their model reasoned over protected patient data, and that reasoning quality was their entire pitch to hospital buyers. We could have staffed it faster as an outsourced build, and I told them not to. They built in-house, ate a slow nine-month ramp, and two years later that decision was unambiguously right because the capability had become impossible for a competitor to replicate without their accumulated data and tuning.

When outsourcing AI development wins

Now the other side, which is the side I am obviously biased toward, so apply a discount to everything that follows.

Outsource AI development when speed is the binding constraint. If the window to ship is measured in weeks and the cost of being late is real, you cannot wait out a hiring cycle. A senior in-house hire takes four to six months to recruit and another three to six months to ramp to full output, which means six to nine months before they ship something that matters. A standing partner team ships in weeks.

Outsource when you are validating before you commit. Plenty of AI initiatives should not exist, and the cheapest way to find that out is to build the thing quickly with a partner, put it in front of users, and see whether it earns its keep before you commit to a million-dollar standing team. Outsourcing is a fantastic de-risking instrument for the "should we even do this" question.

Outsource when the scope is narrow and bounded. A well-defined AI feature with a clear spec and a clear finish line is ideal outsourcing work. You are not trying to own a capability forever; you are trying to get one thing built correctly and integrated cleanly.

A senior in-house hire takes six to nine months before they ship something that matters. A standing partner team ships in weeks. Speed is the most underpriced variable in this decision.

One pattern I see constantly: a Series A company tries to hire its first AI engineer, spends five months failing to close a candidate in a bidding war it cannot win, and burns the runway that the feature was supposed to protect. They would have been far better served renting a senior team for the first two quarters and hiring later, from a position of a working product rather than a blank repo.

The hybrid and staff-augmentation middle (the real default)

Here is the thing most of these comparisons get wrong: they frame it as a binary, in-house or outsourced, when the model that actually wins most often is neither. It is hybrid, and hybrid is not a compromise. It is frequently the correct answer outright.

The structure that works looks like this. Your internal people own the things that must stay internal: product direction, domain context, the architecture decisions that lock in your IP, and the final call on what good looks like. The partner brings execution capacity and methodology, the people who have shipped this kind of system before and will not relearn the lessons on your budget.

There is a meaningful distinction inside "hybrid" worth naming. Staff augmentation is when you rent individual engineers who plug into your team and follow your priorities day to day; you keep the steering wheel. Managed outsourcing is when you hand a partner an outcome with a defined service level and let them own the how. Augmentation is right when you want to own the architecture and direction. Managed delivery is right when you want to outsource a result, not a process.

The reason hybrid is the default for mid-sized efforts, roughly the $500K to $1.5M range where neither pure model is clearly correct, is that it lets you keep the moat in-house while renting the capacity and the methodology. You are not choosing between control and speed. You are buying the specific amounts of each that your situation needs. For more on how the underlying team shape is changing as AI absorbs the production work, I wrote about that in what a team is for after the machine does the work.

IP and control: where deals actually die

If a build-vs-outsource decision blows up, it is usually not about cost or speed. It is about IP and control, and it blows up because nobody nailed those terms down before the work started.

Three things have to be settled in writing before the first line of code. First, IP assignment: who owns the code, the model artifacts, the eval sets, and the fine-tunes when the engagement ends. The default in a good contract is that everything created for you belongs to you, but defaults vary and ambiguity here is poison. Second, data residency and access: where your data lives, who can see it, whether it ever transits a third-party API, and what happens to it at the end. In regulated industries this single issue kills more deals than price ever does.

Third, exit and continuity: what happens when the partner leaves. Outsourcing creates a real risk that critical knowledge walks out the door at the end of the statement of work. The way you de-risk it is by requiring documentation, handover, and ideally an internal owner who shadows the work from day one, so the partner is transferring capability rather than hoarding it. I have watched companies discover, the day a vendor offboarded, that nobody internal understood the system they were now responsible for. That is an avoidable disaster, and it is on both parties to avoid it.

The honest version of the control tradeoff is this: in-house gives you maximum control by default but no speed, and outsourcing gives you speed but only as much control as your contract preserves. Good contracts preserve a lot. Lazy ones preserve almost nothing. The control you lose to outsourcing is mostly the control you failed to write down.

In-house vs outsourced AI development cost and speed: the honest math

Let me put numbers on this, with a heavy caveat: these are illustrative ranges from widely-cited 2026 estimates, not a quote, and your real figures depend on your market, seniority mix, and scope.

// Illustrative 3-year total cost of ownership, not a quote // Source: one widely-cited 2026 vendor estimate, rounded in_house_team = 2 to 3 senior engineers in_house_3yr_TCO = $2.8M to $5.1M // salary + recruiting + tooling + ramp loss in_house_first_ship = 6 to 9 months // 4-6 mo hire + 3-6 mo ramp outsourced_3yr_TCO = $0.3M to $0.45M // equivalent scoped output outsourced_first = 2 to 4 weeks // Reported AI-engineer voluntary attrition ~38%/yr vs ~13% traditional // Every departure resets ramp and re-opens a hard-to-fill role

Treat the headline multiple with suspicion: a single vendor reporting that in-house is six to seventeen times more expensive is exactly the kind of number a vendor would report, and I say that as a vendor. The directional truth holds anyway. In-house carries a large fixed cost and a long delay to first output; outsourcing carries a lower variable cost and near-immediate output. What the simple comparison hides is the compounding value of an in-house team on a capability that is genuinely yours, which never shows up in a three-year spreadsheet but is the entire reason to build. For a deeper breakdown of the staffing economics specifically, see what an AI engineer actually costs.

The attrition point deserves its own line. AI engineers are reported to leave at roughly triple the rate of traditional engineers, and every departure on a small in-house team resets the ramp clock and re-opens a role that takes months to fill. That volatility is a real, recurring cost of the in-house path that the salary line item never captures.

A decision checklist you can run in 20 minutes

If you want to make this call quickly and defensibly, answer these. The pattern of your answers points clearly at in-house, outsourced, or hybrid.

Is the AI the product or a feature? Product points to in-house; feature points to outsourced or hybrid.
What is your window to first production output? Under three months effectively rules out a pure in-house build.
Is the capability your moat, or table stakes? Moat stays in-house; table stakes can be rented.
Can your data legally and safely leave your perimeter? If not, in-house or a tightly-scoped partner with strict residency terms.
Do you have a multi-year roadmap for this capability, or a one-time need? Roadmap favors building; one-time favors outsourcing.
Can you actually hire the talent on your timeline? In a market where most employers cannot fill AI roles, be honest here.
Have you validated the use case, or are you guessing? Unvalidated points to a fast outsourced pilot before any standing-team commitment.

If most answers point one direction, you have your answer. If they split, you are a hybrid case, which is the most common outcome and nothing to apologize for. This is exactly the decision a readiness assessment is built to run, before you commit any implementation budget.

In-house vs outsourced vs hybrid, side by side

Dimension	In-house	Outsourced	Hybrid / staff-aug
Cost shape	High fixed; $500K–$1.5M/yr small team	Variable; scoped, turn up or down	Mixed; internal core + rented capacity
Time to first output	6–9 months (hire + ramp)	2–4 weeks	Weeks for capacity; internal ramps in parallel
Control over IP/architecture	Maximum by default	As much as the contract preserves	High; internal owns moat, partner executes
Key risk	Attrition, slow ramp, hiring failure	Lost context, vendor lock-in, IP ambiguity	Coordination overhead; needs strong internal owner
Best when	AI is the product; regulated; multi-year moat	Speed; validation; narrow bounded scope	$500K–$1.5M; moat in-house, capacity rented

The table makes the hybrid column look like the safe middle, and for a lot of teams it genuinely is. But "safe middle" only works if you have a strong internal owner to hold the steering wheel. Without one, hybrid degrades into expensive outsourcing with extra meetings.

Frequently asked questions

Is in-house or outsourced AI development cheaper?

Outsourcing is almost always cheaper in the first one to three years, because you skip the fixed costs of recruiting, salaries, tooling, and ramp. Widely-cited 2026 estimates put the three-year gap at several times, though those come from vendors and deserve a discount. The catch is that pure cost is the wrong lens: if the capability is your moat, the compounding value of an in-house team is the entire point and never appears in a cost comparison.

How long does it take to build an in-house AI team versus outsourcing?

A senior in-house hire typically takes four to six months to recruit and another three to six months to ramp, so six to nine months to meaningful output. A standing partner team usually ships in two to four weeks because it already exists and knows how to ship this kind of work. If your window is under a quarter, in-house is effectively off the table.

What is the hybrid model for AI development?

Hybrid keeps your internal people owning product direction, domain context, and the IP-defining architecture, while a partner provides execution capacity and proven methodology. Staff augmentation rents individual engineers who follow your priorities; managed delivery hands a partner an outcome with a service level. It is the default for mid-sized efforts because it keeps the moat in-house while renting the speed.

How do I protect my IP when outsourcing AI development?

Settle three things in writing before any code is written: IP assignment (everything created for you belongs to you), data residency (where your data lives, who sees it, whether it leaves your perimeter), and exit terms (documentation, handover, and an internal owner who shadows the work). Most outsourcing disputes are not about cost; they are about control nobody bothered to write down.

If you are working through this decision and want a straight read on which model fits your situation rather than a sales pitch, a Devlyn readiness assessment maps your use cases, data, and timeline to the right staffing shape before you spend implementation budget. And if the answer is "rent the capacity," the team we place is built for exactly the outsourced and staff-augmentation paths above. This article sits under my fuller guide to hiring AI engineers; for the longer argument about where human value concentrates when execution commoditizes, see the judgment economy and the framework in Building an AI-Native Team.

Staff Augmentation vs Consulting: Who Owns the Outcome

Alpesh Nakrani — Wed, 29 Apr 2026 18:30:00 GMT

Staff augmentation vs consulting comes down to one question: who owns the outcome. Here is when each fits, what it really costs, and how to choose for AI work.

The difference between staff augmentation vs consulting is not really about cost, headcount, or contract length, even though that is how most comparisons frame it. It is about one thing: who owns the outcome. With staff augmentation, you rent capacity, the people embed in your team, follow your direction, and you stay accountable for whether the work lands. With consulting or managed services, you buy an outcome, the firm brings its own method and team, decides how to deliver, and carries a real share of the delivery risk.

So the short version is this. Staff augmentation fits when you know what to build and need hands to build it. Consulting fits when the problem is still fuzzy, or when you want the risk off your own plate.

I want to be honest about my own position before I go further, because it should change how you read this. I run Devlyn, and we sell both models, staff augmentation when a client needs senior engineers inside their team, and consulting and delivery when they need someone to own a scope end to end. So I do not have a horse in this race the way a pure staffing firm or a pure consultancy does. I have sat in both seats, sold both, and watched both go wrong. The expensive mistakes I see are almost never "they picked the wrong one." They are teams that bought the contract shape that felt comfortable rather than the one that matched who could actually own the result.

Key takeaway: The decision axis is ownership, not price. Staff augmentation keeps the outcome yours; consulting and managed services transfer part of it to the vendor.
Control and accountability are the same lever. The more control you keep, the more risk you keep. You cannot hand off the risk and also dictate the how.
"Cheaper" is a trap. Staff augmentation has a lower headline rate but you absorb the delivery risk; consulting costs more partly because the firm prices that risk in.
For AI work, the cut is clean. Architecture set and data ready: augment. Still deciding what to build: consult. Most teams need a little of both, in that order.
The hybrid is the normal answer. Scope it with advisory, then build it with embedded engineers, is the most common real-world shape and nothing to apologize for.

Staff augmentation vs consulting, defined without the fog

Strip the marketing language off both terms and they are simple. Staff augmentation means you bring external people into your organization to extend a team you already run. They work inside your workflows, attend your standups, use your tools, and report to your managers. You decide what they build and in what order. You keep the visibility and the steering wheel, and you keep accountability for whether the project succeeds.

Consulting means you hire a firm to deliver a result, not to fill seats. The firm owns the process. It chooses the approach, selects the tools, staffs the work as it sees fit, and is held to a defined scope or set of milestones. Your job shifts from directing the work to defining what you need and signing off on whether it was delivered. Knowledge transfer is usually part of the deal, because the firm is not a permanent fixture in your org.

The reason these blur together in practice is that both put outside people on your problem, and the invoices can look similar. But the operative difference is direction and ownership. In one model you are the manager and the firm is supplying labor. In the other the firm is the owner and you are the client. Everything else, the rates, the contract, the reporting, follows from that one distinction.

In one model you are the manager and the firm is supplying labor. In the other the firm is the owner and you are the client. Everything else follows from that.

Who owns the outcome (and who carries the risk)

This is the section that actually decides the question, so I want to be precise. Ownership and risk are the same lever pulled from two ends. When you augment your staff, you assume most of the delivery risk because you control execution. If the project ships late or wrong, that is on your management, not the people you rented. When you hire a consultancy, you transfer a portion of that risk to the firm, because the firm controls execution and is accountable for the result.

That transfer is not free, and it is not total. A consultancy that is accountable for an outcome it does not fully control will price a risk premium into the engagement, the published figures I have seen put vendor risk premiums somewhere in the range of 20 to 50 percent over a pure cost estimate, and that matches what I have observed pricing delivery work myself. You are paying that premium to move the downside onto someone else's balance sheet. Whether that is worth it depends entirely on whether you actually have someone internally who could own the outcome instead.

Here is the trap I watch teams fall into: they want the low day rate of staff augmentation and the hands-off accountability of consulting at the same time, and that combination does not exist. If you want to keep control of the how, you keep the risk; if you want the risk off your plate, you give up control of the how. Trying to have both produces the worst version of either, you micromanage a consultancy you are paying to own the work, or you abdicate to augmented engineers who were never set up to own it.

I sat with a founder last year who had hired three contract engineers, then got frustrated that the AI feature they were building kept missing the mark. He wanted to blame the engineers, but there was no internal product owner writing specs, defining what "good" looked like, or evaluating output against intent. He had bought capacity and expected outcome ownership to arrive with it, which it never does. We restructured the engagement so that one senior person actually owned the spec and the evaluation loop, the same engineers started shipping the right thing, and it was clear the model had never been the problem, the missing owner was.

The cost and control tradeoff nobody prices honestly

The honest cost comparison is not "staff augmentation is cheaper." It is "staff augmentation has a lower headline rate and a higher hidden cost, and consulting has a higher headline rate and a lower hidden cost." The hidden cost on the augmentation side is your own management time, your delivery risk, and the senior judgment you have to supply to steer the work. The hidden cost on the consulting side is the risk premium and the reduced control.

Staff augmentation is typically priced per person, an hourly or monthly rate that gives you a predictable run-rate flexing with headcount. You can scale up for a sprint and scale down after, and you can model the cost on a spreadsheet because it is just rate times people times months. This is essentially a time-and-materials structure, and time-and-materials buys you adaptability and fast learning at the price of carrying the budget risk yourself.

Consulting and delivery work is usually priced fixed-fee, on retainer, or increasingly on outcomes, tied to a measurable result rather than hours. Fixed-fee buys predictability, but only when the requirements are genuinely stable, because the firm prices uncertainty as a premium and will defend its scope hard the moment things move. Outcome-based pricing, where the fee tracks a business result like cost-to-serve or delivery velocity, is showing up more in AI engagements specifically, because AI work reduces manual effort over time in ways that hours-based billing captures badly.

The control dimension mirrors the cost one. Augmentation gives you maximum control and maximum responsibility: you see everything, direct everything, and own everything that goes wrong. Consulting gives you less granular control and less day-to-day responsibility: you define the destination and grade the arrival, but you do not pick the route. Neither is better, they are different trades, and the right one depends on whether your scarce resource is money, time, control, or senior judgment. I worked through the staffing economics in what an AI engineer actually costs, worth reading alongside this if cost is your binding constraint.

Where managed services sit on the line

People ask me about staff aug vs managed services as if it is a third, separate question, but managed services live on the same axis, just further toward the "vendor owns it" end. Where project consulting delivers a defined result and then hands off, managed services take ongoing ownership of an entire function against a service-level agreement. You define what needs to be delivered and to what standard; the provider owns the people, the process, and the results indefinitely, or for as long as the contract runs.

So the spectrum runs cleanly from one end to the other. Staff augmentation is people-based, you own everything. Consulting is outcome-based and time-bounded, the firm owns a defined deliverable. Managed services is outcome-based and ongoing, the provider owns a whole function with guarantees attached. Cost generally climbs as you move down that line, because each step transfers more risk and more responsibility to the vendor, and the vendor charges for carrying it.

For AI specifically, managed services make sense for the parts that are operational rather than differentiating, monitoring a model in production, retraining on a cadence, keeping an inference pipeline healthy. You would not usually hand your core product judgment to a managed-services contract, but the plumbing that has to run reliably and is not your competitive edge is a reasonable thing to offload, with an SLA that makes someone else accountable for uptime.

When staff augmentation wins

Staff augmentation wins when you know what you are building and the constraint is hands, not direction. If the architecture is decided, the data pipelines exist, and you need a senior engineer to build and ship a specific thing, you do not want to pay a consultancy to re-examine your strategy. You want capacity that slots into a plan you already trust.

It also wins when you have the senior judgment internally to steer and evaluate. The whole model depends on someone on your side who can write a clear spec, recognize when the output is wrong, and own the result. If you have that person, augmentation is the efficient choice, you are buying execution and supplying the judgment yourself, which is the cheaper half to buy.

And it wins for capacity gaps with a known shape: a launch deadline, a seasonal spike, a six-month build where you need three more engineers than you have. Augmentation flexes up and down without the fixed cost and the long ramp of hiring full-time, and without the accountability handoff of consulting that you do not need when the plan is already clear.

Staff augmentation wins when the constraint is hands, not direction. You have the plan and the judgment; you are buying execution and supplying the steering yourself.

When consulting or managed services wins

Consulting wins when the problem is still ambiguous. If you cannot yet write the spec, if "build us an AI feature" is the clearest statement anyone can make, then handing someone capacity is premature. You will get motion without direction. What you need first is someone to own the discovery, figure out what to build, and only then build it, and that ownership is exactly what consulting is for.

It wins when you have no internal owner with the seniority to steer the work. This is the most underrated trigger. If nobody on your team can confidently evaluate whether AI output is correct, augmenting your staff just buries the risk, because the people you rented were never set up to own it and you cannot grade what they produce. In that situation, paying a firm to own the outcome, and to transfer knowledge as it goes, is the responsible move even though it costs more.

And it wins when you genuinely want the delivery risk off your plate and are willing to pay the premium for it. A regulated rollout, a board-level commitment with a hard deadline, a domain where being wrong is expensive, these are cases where moving the downside onto a firm that is contractually accountable is worth more than the rate difference. You are buying a guarantee, not just labor. I argued the parallel build-versus-buy decision in in-house vs outsourced AI, which is the question you should settle before this one.

The hybrid most teams actually need

In practice the answer is rarely pure. The most common shape that works is sequential: use consulting or a readiness assessment to scope the problem and decide what to build, then use staff augmentation to build it once the plan is solid. You buy outcome ownership for the ambiguous front end, where you most need someone accountable for direction, and you switch to capacity for the execution phase, where you have a plan and just need hands.

I have run this exact pattern with clients who came in saying "we need engineers" when what they actually needed was two weeks of someone owning the question of what those engineers should build. We scoped it as advisory, produced a plan they trusted, and then placed the engineers to execute it. The reverse also happens, augmentation teams that hit a wall they cannot architect their way past, and we drop in a short consulting engagement to unblock the design before handing the keyboard back. The point is that these are not rival religions. They are tools for different phases of the same project, and good operators move between them without ceremony.

How to choose for AI work specifically (the staff augmentation vs consulting decision)

AI work sharpens the staff augmentation vs consulting decision because the failure mode of getting it wrong is worse. With ordinary software, capacity without direction produces slow, mediocre output you can see and fix. With AI, it produces confident, plausible output that is wrong in ways that are invisible without deep expertise, which is exactly the gap I keep returning to in this work. So the question of who can actually evaluate the output is not a nicety. It is the deciding factor.

Here are the questions I walk leaders through. Is your architecture and data foundation already set, or are you still deciding the approach? Do you have someone internally who can read AI output and know immediately whether it is correct? Is the problem well-specified, or is "build AI into the product" still the clearest sentence anyone can write? Do you want to keep control of the build, or get the delivery risk off your plate?

If the architecture is set, you have an internal owner who can evaluate, and the spec is clear, augment. You are buying hands for a plan you trust. If the problem is fuzzy, you have no one who can confidently grade AI output, or you need the risk transferred, start with consulting, and possibly move to augmentation once the plan firms up. This article sits under my fuller guide to hiring AI engineers, which covers the seniority bar and the evaluation problem in depth; the underlying argument about why judgment is the scarce input is in Building an AI-Native Team.

Staff augmentation vs consulting vs managed services, side by side

Dimension	Staff augmentation	Consulting / project	Managed services
Who owns the outcome	You	The firm, for a defined scope	The provider, ongoing
Who carries delivery risk	You	Shared, firm carries a share	Provider, against an SLA
Who directs the work	Your managers	The firm	The provider
Pricing model	Per person, time & materials	Fixed-fee, retainer, or outcome	Retainer or outcome, with SLA
Headline cost	Lower	Higher	Highest
Hidden cost	Your management time and risk	Risk premium, less control	Less control of a whole function
Best when	Scope clear, you have an owner	Problem ambiguous, no owner	Ongoing non-core function
Flexibility	Highest, scales with headcount	Bounded by scope	Bounded by contract term

If you are weighing these models for an AI build and want a straight read rather than a sales pitch, a Devlyn readiness assessment maps your use cases, data, and timeline to the right shape before you commit any implementation budget. And if you already know what to build and just need senior hands inside your team, the engineers we place are set up for exactly the staff-augmentation path above.

Frequently asked questions

What is the main difference between staff augmentation and consulting?

Ownership of the outcome. Staff augmentation rents you capacity, the people embed in your team, follow your direction, and you stay accountable for the result. Consulting buys you an outcome, the firm owns the method and the delivery and carries a share of the risk. Everything else, the rates, the contracts, the reporting, follows from that one distinction about who is steering and who is accountable.

Is staff augmentation cheaper than consulting?

It has a lower headline rate, but "cheaper" is misleading. With staff augmentation you absorb the delivery risk and supply the senior judgment to steer the work, which are real costs that do not show up on the invoice. Consulting costs more partly because the firm prices that risk into the fee. The cheaper option is whichever matches a resource you already have, internal judgment makes augmentation cheap; the lack of it makes consulting worth the premium.

What is the difference between staff aug vs managed services?

They sit on the same spectrum at opposite ends. Staff augmentation is people-based, you own the work and direct it. Managed services is outcome-based and ongoing, the provider owns an entire function against a service-level agreement and is accountable for the results indefinitely. Consulting sits between them: outcome-based like managed services, but time-bounded to a defined deliverable rather than running continuously.

Which model is better for an AI project?

It depends on whether you know what to build and who can evaluate the output. If your architecture is set, your data is ready, and you have someone who can tell a correct AI output from a confidently wrong one, staff augmentation is efficient. If the problem is still ambiguous or no one internally can grade the output, start with consulting to scope it, then augment to build. The hybrid, consult to scope and augment to build, is the most common real answer.

If you are working through this decision for a specific AI build, a readiness assessment is built to run exactly that mapping, use cases, data, and timeline to the right staffing shape, before you spend a dollar on implementation.

AI Team Structure: The Roles You Need in 2026

Alpesh Nakrani — Tue, 28 Apr 2026 18:30:00 GMT

The roles an AI team needs have not changed much. What changed is the shape: fewer people, more senior, and a real evaluation function at the center.

The right AI team structure in 2026 is a small core of senior people, each owning a wide slice of the problem: an AI application engineer who ships features on top of models, a data engineer who feeds them, someone on platform and MLOps so the thing runs, and a dedicated evaluation owner whose whole job is to know whether the output is correct. Above that sits a product owner who can write a spec precise enough to steer a model. Most teams need fewer of these people than they think, and they need them more senior than they want to pay for.

I have built and deployed more than 80 senior AI engineers into teams at Devlyn, and I sit in two seats while I do it: I read the traces and I read the P&L. That combination is the reason I am skeptical of most org charts I get handed. They are drawn for a production constraint that no longer exists. They assume the hard part is generating the artifact, so they staff up to generate more of it. The hard part is no longer generation. It is judgment, and the team you build should be shaped around that.

This piece is a supporting guide under my pillar guide to hiring AI engineers. There I cover what good looks like and how the bad hires fail. Here I want to answer a narrower, more practical question: which roles do you actually need on an AI team, who owns what, how big should the team be at your stage, and how is all of this different now that AI does so much of the work that used to require headcount.

The role list is short. Application/AI engineer, ML engineer, data engineer, MLOps/platform, an evaluation owner, and a product owner. Most teams do not need all six on day one.
The shape changed, not the roles. Fewer people, each more senior, owning more surface area. Industry placement data through 2026 shows junior demand down sharply while ML and LLM engineer demand climbs.
Evaluation is now a seat, not a chore. Someone has to own whether the output is correct. If nobody owns it, everybody assumes someone else does, and you ship the confident wrong answer.
Size by stage, not by ambition. Two or three before product-market fit, five to ten in growth, structure only when coordination cost forces it.
The anti-patterns are predictable. Juniors hidden behind AI, MLOps hired before there is a model, and a team measured on throughput instead of outcomes.

If you are standing up an AI team right now and would rather buy pre-vetted senior judgment than run a three-month search, this is exactly the work my team does. You can hire an AI application engineer through Devlyn and skip the part where you gamble on candidates you cannot fully evaluate yourself.

The core roles an AI team needs, and what each one owns

Let me name the roles plainly, because the AI team structure debate gets cluttered with invented titles that do not map to ownership. There are six functions worth naming. Whether they are six people or two people wearing three hats each depends entirely on your stage.

AI application engineer. This is the person who turns a model into a feature a customer can use without someone standing behind it apologizing. They own the integration: prompts, retrieval, structured outputs, tool calls, the product surface, permissions, and the observability that tells you when it breaks. For most companies adding AI to an existing product, this is the first and most important hire, and it is the role I get asked to fill more than any other.

ML engineer. They own the model itself: fine-tuning, training pipelines, the modeling decisions when an off-the-shelf model is not enough. If your AI is a layer on top of someone else's models, you may not need this role for a long time. If the model is your product, this is where you start.

Data engineer. They own the pipelines that feed everything: ingestion, transformation, the retrieval store, data quality. AI features fail on bad data far more often than on bad models, and this role is chronically under-hired because it is invisible until it is the bottleneck.

MLOps / platform engineer. They own deployment, monitoring, inference cost, scaling, and the boring reliability work that decides whether the feature survives contact with real traffic. Critically, you hire this role when reliability is your constraint, not before.

Evaluation owner. Someone whose job is to know whether the output is correct, by failure mode and severity, and to own the gate between "it generated something" and "we ship it." This used to be a chore distributed across everyone, which meant nobody did it. It is now a seat.

AI product owner. The person who can write a specification precise enough to constrain what the model produces, and who can tell you what good looks like before the work starts. When generation is cheap, the spec is the lever, and a vague spec produces plausible work that is quietly wrong.

The roles table: who you need, what they own, when

Here is the same thing in a form you can paste into a planning doc. The "when you need it" column is the one people skip, and it is the one that saves you the most money.

Role	What it owns	When you need it
AI application engineer	Model-to-feature integration, prompts, retrieval, product surface, observability	First hire when adding AI to a product
ML engineer	The model: fine-tuning, training, modeling decisions	When the model is the product, or off-the-shelf stops being enough
Data engineer	Pipelines, retrieval store, data quality	Early, the moment data volume or quality becomes the bottleneck
MLOps / platform engineer	Deployment, monitoring, inference cost, scaling	When reliability and cost at traffic are the constraint, not before
Evaluation owner	Whether output is correct; the ship/no-ship gate	The day real users see model output
AI product owner	The spec, success criteria, what good looks like	From day one, even if part-time on a founder

How AI changes the shape: fewer people, more senior

The roles above would have been recognizable five years ago. What has genuinely changed is the headcount math, and the direction of the change is consistent everywhere I look. When a capable model can produce a first draft, a working prototype, or a test suite in seconds, one senior person covers far more ground and the number of bodies you need to wrap around the work contracts.

The labor data points the same way. Industry placement data reported through 2025 and 2026 shows junior developer demand down roughly 40%, while demand for ML engineers climbed sharply and a brand-new LLM engineer role appeared almost from nothing. One staffing firm's own placement mix shifted from roughly 60% mid-level and 30% senior in 2022 to 25% mid-level, 65% senior, and 10% AI specialist by 2026. The shape of that shift matters more than the exact percentages: the team is smaller and it is tilted hard toward senior.

The team is smaller and it is tilted hard toward senior. One senior who can architect and evaluate is worth three production-oriented juniors when the output you need is output you can trust.

I run this as a deliberate posture, not an accident of the market. Senior engineers only, no juniors hidden behind AI. That is not a judgment about junior engineers as people. It is a statement about what the work now requires: someone who can read model output and know immediately whether it is correct. The gap between a plausible wrong answer and a right one is invisible without deep expertise, and hiring people who cannot see that gap does not reduce risk, it buries it. I make the full version of this argument in my piece on senior versus junior AI engineers.

There is supporting evidence in the broader labor market too. A PwC study of over a billion job postings found that AI-exposed entry-level jobs in the US are now several times more likely to demand traditionally senior skills, things like judgment and leadership, than the same roles were in 2019. The work that used to be a starting rung is being pulled upward into territory that assumes you already have judgment. That is the same compression I see inside teams.

The evaluation function is now a real role, not a side task

If you take one structural idea from this piece, take this one. On an AI team, someone has to own evaluation, and it cannot be a thing everyone does in the cracks between their real work.

Here is why it became a seat. When humans did the generation, the quality check was baked into the act of producing. The engineer who wrote the code understood the code. When a model generates the code, that understanding is no longer automatic. You can ship something nobody on the team actually verified, and it will look completely fine right up until a customer hits the edge case you never evaluated.

The evaluation owner builds and maintains the eval suite: a held-out set of representative inputs, labeled with the outputs you want, with errors categorized by type and severity. They own the gate. They are the reason you can answer "how do you know this is good enough to ship" with data instead of a shrug. I have watched the bottleneck on shipping speed turn out to be confident evaluation far more often than generation, and a team with a real evaluator loops less and ships faster. This connects directly to the discipline I describe in human-in-the-loop evaluation, where "a human reviews it" is not a plan unless someone owns the review.

One pattern from the field, NDA-safe and composited from teams I have worked with: a company added an AI feature, declared it done after a two-hour vibe check, and shipped. It worked for three weeks. Then a class of inputs nobody had evaluated produced confidently wrong answers in front of paying users, and a human had to clean up each one in real time. The fix was not a bigger model. It was hiring one person to own the eval suite and the gate. The loops stopped.

How big should an AI team be, by stage

The honest answer is smaller than you want it to be, and the right number is set by your constraint, not your ambition. Here is the shape by stage.

Before product-market fit, two to three people. An AI application engineer and a product owner is a real team, and the founder is often the product owner. You are not optimizing for throughput yet. You are trying to find out whether the thing works at all, and a small senior team finds that out faster than a large junior one.

In growth, five to ten people. Now you add a data engineer because data quality is your bottleneck, an evaluation owner because real users are seeing output, and an MLOps or platform engineer once reliability and inference cost start to bite. An ML engineer enters here only if the model itself is a differentiator. A widely cited Gartner forecast holds that by 2030 the great majority of engineering teams will be smaller, AI-augmented units, and growth-stage AI teams are where that future is already visible.

At scale, you add structure, not just people. This is where reporting lines and spans of control actually start to matter, and where the temptation to hire to a headcount target is most dangerous. Resist adding people faster than you add the senior judgment to direct them.

On ratios, the one I watch most is senior-to-junior, and I keep it heavily senior for the reasons above. The second is engineer-to-evaluator: you do not need a one-to-one, but you need at least one person whose primary accountability is evaluation before you have more than a couple of engineers generating output. The third, easy to forget, is that a data engineer often unblocks more value than the next model hire, because the failures are usually upstream of the model.

Reporting lines and ownership: flatter, with the outcome owned end-to-end

The org chart for an AI team should be flatter than the one you would have drawn for a same-output software team, because the coordination middle thins when the work is more self-directing. Fewer layers sit between the person setting intent and the output. Each person owns more surface area and gets more done per hour. I unpack the macro version of this in what a team is for after the machine does the work.

The principle I hold to is ownership over hours, outcomes over velocity. I am not measuring presence or pace. I am measuring whether the outcome was good and whether this person drove it. That selects for people who actually want to own things, which is a different population than people who are good at looking busy.

On where evaluation reports: keep it independent enough that the person who owns the gate is not the same person racing to ship through it. On a small team that can be the same human wearing the discipline of two hats, but the moment you can afford to separate generation from evaluation in the reporting line, do it. The whole point of the evaluation seat is that it can say no.

The whole point of the evaluation seat is that it can say no. Keep it independent enough that the person who owns the gate is not the same person racing to ship through it.

The anti-patterns I see most often

Most broken AI teams are broken in the same handful of ways. Here are the ones I run into most, and what each one costs.

Juniors hidden behind AI. A team staffs up with junior engineers on the theory that AI makes everyone senior. It does not. It makes the gap between a plausible wrong answer and a right one harder to see, which is exactly the gap juniors are still learning to see. You do not save money here. You defer the cost to production, where it is more expensive.

MLOps before there is a model. Teams hire platform and MLOps people early because it feels rigorous. But if you have not shipped anything to real traffic, there is nothing to operate. You end up with sophisticated infrastructure around a feature that has not proven it should exist. Hire MLOps when reliability is the constraint.

No evaluation owner. Covered above, but it earns its place on the anti-pattern list because it is the most common and the most expensive. If nobody owns whether the output is correct, everyone assumes someone else does, and the confident wrong answer ships.

Hiring for throughput. Job descriptions written for production speed, interviews that test how fast someone can produce, performance reviews counting tickets closed. All of it selects for the skill that AI has made cheap and ignores the skill that is now scarce, which is judgment. I cover what to test for instead in my guide to the skills that actually matter for AI engineers.

One more composited story from the field. A team I advised was convinced it needed to double its engineering headcount to ship its AI roadmap. When we mapped who owned what, the real gap was not generation capacity, it was a single missing evaluation owner and a data engineer to fix the retrieval store. They shipped the roadmap with the team they already had, plus those two hires, and the headcount they almost added would have made coordination worse, not output better.

If you want the full framework for shaping a team around judgment rather than throughput, including how to interview for it, I wrote Building an AI-Native Team for exactly that. And if you would rather not run the search yourself, my team places pre-vetted senior engineers into this exact shape, you can hire an AI application engineer and start with judgment already in the room.

Frequently asked questions

What roles does an AI team need? At a minimum: an AI application engineer who ships features on top of models, a data engineer to feed them, an MLOps or platform engineer for reliability, a dedicated evaluation owner who decides whether output is correct, and a product owner who can write a precise spec. An ML engineer joins when the model itself is your differentiator. Most teams do not need all of these on day one, they need the first two or three, more senior than feels comfortable.

How does AI change the structure of a team? It makes the team smaller and more senior, and it promotes evaluation from a shared chore to a named role. When generation is cheap, the bottleneck moves from producing the work to judging whether the work is right, so the org flattens, each person owns more surface area, and the senior-to-junior ratio tilts hard toward senior.

How big should an AI team be? Two to three people before product-market fit, five to ten in growth, and structure rather than raw headcount at scale. Set the size by your constraint, data quality, reliability, or evaluation, not by your ambition. A smaller senior team usually ships faster than a larger junior one.

Should I hire an ML engineer or an AI application engineer first? If you are adding AI to an existing product, hire the application engineer first, they turn a model into a feature users can trust. Hire an ML engineer first only when the model itself is the product. If you cannot fully evaluate either candidate yourself, buy the judgment pre-vetted rather than gambling on a long search, which is the work my team at Devlyn does.

When to Hire an AI Engineer (and When to Wait)

Alpesh Nakrani — Mon, 27 Apr 2026 18:30:00 GMT

When to hire an AI engineer: the signals that mean it is time for your first AI hire, the signals that mean wait, and what hiring too early actually costs.

The honest answer to when to hire an AI engineer is this: hire one when AI work has become recurring, core to the product, and is breaking in production in ways your current team cannot diagnose. Wait when the AI is still a one-off experiment, when you are pre-product-market-fit, or when you cannot yet evaluate the candidate you would be hiring. Most teams I talk to are on the wrong side of that line in one direction or the other. They either hire a senior specialist to babysit a feature that an API call would have handled, or they keep duct-taping a production-critical system together with no one who owns whether the model is actually right.

I have made this hire more than 80 times at Devlyn, and I sit in two seats while I do it: I read the model traces and I read the P&L. That combination is why I am allergic to the standard hiring advice, which treats the first AI hire as a sourcing problem you solve as early as possible. It is not. The first AI hire is a timing decision, and timing it wrong is one of the more expensive mistakes a founder can make in 2026, because the market is the tightest it has ever been and a wrong senior hire does not show up as a problem until it has already cost you six months.

This piece is the timing decision, laid out from both seats. It is part of my broader guide to hiring AI engineers, which covers what the role is and how to vet it; here I am only answering the question of whether it is time at all. I will give you the signals that mean go, the signals that mean wait, a table you can run against your own situation today, what to do before the hire, how to structure the first one, and what hiring too early actually costs.

Hire when AI is recurring, core, and failing in production. The trigger is not "we want to use AI." It is "AI is now load-bearing and no one owns whether it is correct."
Wait when you are pre-PMF or the work is a one-off. APIs, no-code, and a scoped consultant are the correct tools for validation. A full-time hire is not.
The market is structurally short. Demand for AI engineers outruns supply by roughly three to one, and senior searches run four to six months. Do not start that clock before you need to.
If you cannot vet the hire, do not hire full-time yet. Buy pre-vetted judgment through staff augmentation or a fractional engagement until you can evaluate the role yourself.
Hiring too early is a six-figure mistake. A wrong senior AI hire can cost 1.5x to 3x salary all-in once you count recruiting, ramp, and recovery.

When to hire an AI engineer: the signals that mean go

The clearest signal that it is time to hire an AI engineer is that AI work has stopped being a project and become a surface. You are no longer asking "can we add a summarization feature." You are maintaining three or four AI-powered flows, each with its own failure modes, and every model provider change ripples through all of them. When the AI work is recurring and nobody's job description currently contains the words "own whether the model is right," that is the signal.

The second signal is that you have hit the API-wrapper ceiling. There is a real shift happening in 2026: the teams that built lightweight wrappers around model APIs are discovering that the wrapper was the easy 80% and the hard 20% is where the product actually lives. When off-the-shelf tools stop being sufficient, when you need custom retrieval over your own data, when generic prompts no longer clear your quality bar, you have crossed from configuration into engineering. That crossing is a hiring signal.

The third signal is the one I trust most, because it is the one that shows up in production rather than in a planning meeting. It looks like this: data does not flow cleanly between systems, deployment is improvised, monitoring is missing, and the thing that looked solid in testing starts failing in front of real users. That gap between the demo and production is exactly where a real AI engineer earns their cost, and it is invisible until you have enough traffic to expose it.

The trigger to hire is not "we want to use AI." It is "AI is now load-bearing and no one owns whether it is correct."

The fourth signal is that you have a data asset worth building on. If you own a stream of proprietary data, your support transcripts, your transaction history, your domain-specific documents, then there is durable work for an AI engineer to do that a generic API cannot. A data moat turns AI from a feature into a capability, and capabilities need owners. If your AI runs entirely on public models and public data, you have less to defend and less reason to staff up internally.

The fifth signal is sustained, real usage. One external rule of thumb I have seen, and broadly agree with, is roughly a thousand monthly active users held for six months before your first dedicated hire. The specific number matters less than the principle: you hire to scale and harden something that works, not to discover whether it works. Usage that has survived contact with real customers is evidence that the thing is worth hardening.

When to wait on hiring an AI engineer (or outsource instead)

The strongest reason to wait is that you have not found product-market fit yet. Before PMF, your product is a hypothesis, and a full-time AI engineer is a very expensive way to refine a hypothesis. The advice I give founders here is blunt: get the product working and generating revenue first, then hire to scale it. Some teams set a concrete bar, bootstrapping to fifty or a hundred thousand in monthly recurring revenue before the first engineer, on the logic that you only need that help once you know the product works.

The second reason to wait is that the AI work is genuinely a one-off. If you need to classify a backlog of documents once, or stand up a single internal chatbot, that is an API call and an afternoon, not a headcount. The mistake is converting a bounded task into an open-ended salary. A one-off does not generate the recurring, compounding work that justifies a full-time owner.

The third reason to wait is that you have no data and no path to it. An AI engineer with nothing proprietary to build on will spend their time wiring up the same public APIs you could have configured yourself, at five times the cost. If your data is messy, ungoverned, or simply does not exist yet, the most valuable work is cleaning and organizing it, which is often not an AI engineering job at all.

The fourth reason to wait is the one founders hate to hear: you cannot yet evaluate the person you would be hiring, and the market is full of people who talk fluently about models but cannot ship one that survives production. If no one on your team can tell a strong candidate from a confident one, hiring full-time is a gamble with four-to-six-month odds. In that situation you do not freeze; you buy pre-vetted judgment through a fractional or staff-augmentation arrangement until you understand the role well enough to hire for it yourself. I have written more on the build-versus-buy choice in my piece on in-house versus outsourced AI.

A signal-by-signal table you can run today

Here is the decision compressed into something you can hold against your own situation in five minutes. Run each row honestly. If most of your reality lands in the "ready" column, start the search. If it lands in "wait," you have cheaper, faster options that are also the correct ones.

Signal	Ready to hire?	What to do
AI work is recurring and core to the product	Yes	Start the search for a first hire
You have hit the limits of off-the-shelf APIs	Yes	Hire an application engineer to own the custom layer
Production fails in ways no one can diagnose	Yes	Hire now; this is costing you customers
You own proprietary data worth building on	Yes	Hire; the moat justifies an owner
Sustained real usage for six-plus months	Yes	Hire to harden and scale what works
Pre-product-market-fit, still validating	Wait	Use APIs and a scoped consultant; find fit first
The AI task is a bounded one-off	Wait	Solve with an API call, not a headcount
No proprietary data or path to it	Wait	Fix data and governance before staffing AI
You cannot vet an AI engineer yourself	Not full-time yet	Use staff aug or fractional until you can

What to do before your first AI hire

Almost every team that needs an AI engineer eventually has a productive window before that where the right move is APIs plus discipline, not headcount. Build the first version on hosted model APIs. They are good enough to find out whether the feature matters, and they let you move in days instead of the months a hire will take. The goal of this stage is learning, not architecture.

The single most valuable thing you can do in this window is build evaluations from day one. An eval suite, a held-out set of representative inputs labeled with the outputs you want, is what tells you whether the model is actually good enough and, later, whether a candidate's work is. Evals are the AI-native equivalent of tests, and they are the artifact that makes the eventual hire productive in week one instead of week ten. I go deeper on why in evals that predict production.

If the strategy itself is unclear, which use case to fund first, whether your data is ready, what the roadmap should be, that is a scoped advisory engagement, not a full-time hire. We run exactly this kind of work through Devlyn's AI strategy and readiness service: prioritize the use cases, assess data readiness, and produce a roadmap you can actually execute, before you commit a salary to it. The point of this stage is to convert a vague ambition into a defined job, so that when you do hire, you are hiring for a failure you can name rather than a hope you cannot.

Full-time vs fractional vs staff augmentation for the first hire

Once the signals say go, the next decision is the engagement model, and it hinges almost entirely on one question: can you vet the hire yourself? The market context matters here. Open AI roles outnumber qualified candidates by more than three to one, AI skills are now the hardest in the world to hire for, and a senior search commonly runs four to six months, as Pin's 2026 AI compensation benchmarks document in detail. That clock is the hidden cost of defaulting to full-time.

A full-time hire is right when AI is a permanent, central capability and you have someone who can evaluate the candidate and manage the work. You are buying long-term ownership and deep context, and you are accepting a four-to-six-month search and a compensation package that, for a US generative-AI specialist, sits well above a median AI engineer salary of around a hundred and seventy thousand dollars. For more on what that actually runs, see my breakdown of AI engineer cost.

A fractional engineer is right when the work is real and recurring but does not yet fill a full week, or when you want senior judgment guiding a junior team without paying for a full senior seat. Staff augmentation is right when you need to move now and cannot afford the search, or when you cannot vet a permanent hire and want the vetting done for you. Pre-vetted augmentation can put a senior engineer on your problem in two to four weeks instead of four to six months, which is the difference between catching a production problem and explaining it to a board.

This is the path we built Devlyn's AI application engineering team for: senior engineers, pre-vetted, who own a production feature end to end, with a short risk-free trial so you can see the work before you commit. It is the answer to "we need this now and cannot afford to gamble a six-month search on a candidate we are not equipped to evaluate." Use it as a bridge to a confident full-time hire, or as the engagement itself.

Sequencing the first one to three hires

When you do build a team, sequence it; do not hire the org chart you will need in two years. "AI engineer" is not one role, it splits into application, retrieval, MLOps, and various specialists, and hiring the wrong specialization first wastes months. The most common sequencing mistake is leading with a research-flavored ML engineer when what the product actually needs is someone who ships features on top of existing models.

Your first hire is almost always an AI application engineer: the person who takes models someone else built and turns them into reliable, evaluated product features. They give you the broadest immediate coverage because most early AI work is integration, prompt and retrieval design, evaluation, and production hardening, not model training. Hire for the judgment to know when an output is wrong and own the fix, which is the throughline of my whole hiring guide.

The second hire is driven by where the first one is drowning: a retrieval or RAG specialist if retrieval quality over your own data is the bottleneck, or an MLOps hire if the system is brittle in deployment, monitoring, and cost. The third hire is a genuine specialist, fine-tuning, agents, multimodal, but only once a specific, recurring need has proven itself. Each hire should answer a failure the previous configuration could not, not fill a box on an aspirational chart.

Hire toward the failure you cannot tolerate, not the org chart you imagine you will need in two years.

The cost of hiring too early

I want to put real numbers on the downside, because "wait" sounds like caution and "hire" sounds like momentum, and that framing gets founders into trouble. A wrong senior technical hire costs roughly 1.5x to 3x annual salary once you count recruiting fees, ramp time, lost productivity, and the eventual replacement. For an AI role north of two hundred thousand dollars, that is a three-to-six-hundred-thousand-dollar mistake, and it does not announce itself for months.

The failure is rarely that the engineer was incompetent; it is misalignment, a real person hired against an unreal need, the classic example being a startup that recruits a research-flavored ML hire when the product needed a product engineer, a mismatch guides to AI startup hiring flag as a leading cause of failed early hires. I have seen the reported case of a fintech startup that spent twenty-two thousand dollars a month for six months on an AI assistant that, in the end, nobody used. That is not a story about a bad engineer. It is a story about hiring before the problem was defined, which is the most expensive way to discover that the problem was not yet worth solving.

The macro picture says the same thing. Industry analysts have found that roughly seventy percent of AI projects fail to reach production, and the dominant causes are misaligned expectations, unclear role definition, and skill mismatch, not raw technical difficulty. Hiring too early loads the dice toward exactly those failure modes, because you are committing a permanent, scarce, expensive resource to a target that has not stopped moving.

This is the both-seats argument in one line: the engineering case for waiting is that you cannot harden what you have not validated, and the revenue case for waiting is that a premature senior hire is a six-figure bet placed before you know the odds. When the signals genuinely point to go, hire with conviction. Until then, the disciplined move, APIs, evals, a scoped advisory engagement, or pre-vetted augmentation, is not the timid choice. It is the correct one.

Frequently asked questions

Do I need an AI engineer, or can I just use an API?

If your AI work is a bounded, occasional task and generic models clear your quality bar, an API is the right tool and a full-time hire is not. You need a dedicated AI engineer when the work becomes recurring and core, when you have hit the limits of off-the-shelf tools, or when the system fails in production in ways nobody on your team can diagnose. The API is for validation; the engineer is for hardening something that already matters.

What is the right stage for a first AI hire?

After you have product-market fit and sustained, real usage, not before. A common rule of thumb is roughly a thousand monthly active users held for six months, and some founders wait until fifty to a hundred thousand in monthly recurring revenue. The principle behind the numbers is that you hire to scale and harden something that works, not to discover whether it works.

Should my first AI hire be full-time, fractional, or staff augmentation?

It depends on whether you can vet the hire yourself and how fast you need to move. Full-time fits a permanent, central capability when you can evaluate the candidate and absorb a four-to-six-month search. Fractional fits real but part-time work or senior oversight of a junior team, and staff augmentation fits when you need pre-vetted judgment in weeks rather than months because you cannot yet vet a permanent hire.

How much does waiting too long cost me?

Waiting has a real cost too: when AI is load-bearing and unowned, every production failure is borne by customers and patched by people who do not fully understand the system. The signal that you have waited too long is that you are losing customers or burning senior time firefighting model behavior nobody owns. If that is happening, the timing question is already answered, and pre-vetted augmentation is the fastest way to stop the bleeding while you build toward a permanent hire.

If you are at the point where the signals say go and you want senior engineers who own a production AI feature end to end without a six-month search, Devlyn's AI application engineering team works on exactly this, and you can start with a short trial before you commit. For the full picture on what good looks like and how to vet it, start from the hiring AI engineers guide or read Building an AI-Native Team.

AI Engineer Red Flags: How to Spot a Bad Hire

Alpesh Nakrani — Sun, 26 Apr 2026 18:30:00 GMT

The AI engineer red flags that predict a bad hire: no evals, a demo that never shipped, a resume of buzzwords. Here is how to surface each one before you sign.

The AI engineer red flags that actually predict a bad hire are not the ones recruiters screen for. The strongest single signal is a candidate who has never run an eval and cannot tell you how they knew their system worked. Close behind it: a portfolio that is all demo and no production, a resume that lists every framework and explains none of them, and an inability to walk you honestly through a failure. None of those show up in a keyword match, yet all of them surface in twenty minutes if you ask the right question.

I sit in two seats: I hire AI engineers, and I deploy them on live customer-facing systems, so I see both the interview and the eighteen months after it. That second seat is the one most hiring guides are missing. A candidate who interviews beautifully and then cannot ship a system that survives p95 latency and a real cost budget is not a near-miss; they are the most expensive hire you will make, because you pay the salary, the opportunity cost, and the cleanup. This piece is the negative space around my hiring AI engineers guide: not what good looks like, but the warning signs that tell you to stop.

A resume is a list of things a person has stood near. The red flags are the gap between standing near a thing and being able to make it work when it breaks.

Key takeaways

If you read nothing else, these are the load-bearing red flags when hiring AI engineers:

No evals is the highest-signal red flag. An engineer who cannot describe how they measured a system's quality has been shipping on vibes, and vibes do not survive production.
Keyword density is not competence. A resume thick with RAG, agents, and vector databases tells you what they have read, not what they can build. Make them explain a tradeoff.
Demos lie; production does not. A candidate who has only ever shipped notebooks has never met latency, cost-per-task, or a 3 a.m. rollback.
The failure story is the interview. An engineer who blames the model, the data, or the deadline for every failure will blame them on your payroll too.
Balance the flags with green flags. A quiet portfolio with an honest test set beats a loud one with none. Do not reject the right person for bad packaging.

Resume and keyword red flags: when the buzzwords are doing the talking

The first red flag when hiring AI engineers shows up before the interview, on the resume itself. You will see a skills section that reads like a glossary: RAG, agentic workflows, vector databases, prompt chaining, fine-tuning, LangChain, four model providers, two orchestration frameworks. It looks like depth. It is usually breadth pretending to be depth, because the market has taught candidates that the keyword is the qualification.

This is not a fringe concern. ManpowerGroup's 2026 survey of 39,000 employers found that AI model and application development is now the single hardest capability to hire for worldwide, harder than traditional engineering (ManpowerGroup). When a skill is that scarce and that hyped, resumes inflate to match the demand. The keyword soup is a rational response to a hot market, which is exactly why it carries no signal.

The way to surface it is to pick one keyword and go three questions deep. If the resume says "RAG," ask how they chose the chunk size, then ask how they measured whether retrieval was actually returning the right passages, then ask what they did when recall was bad. A real practitioner has scars on every layer of that answer. A keyword candidate gets vague by the second question, because they have read the architecture diagram but never had to make it work on messy data.

GitHub stars and the length of the framework list belong in the same bucket. A repo with three hundred stars and no test set is a marketing artifact. A quiet repo with a frozen evaluation set and an honest failure log is the actual hiring signal, which is the same thing my piece on how to vet AI engineers argues you are really buying. Read past the polish to the part that shows judgment.

The biggest red flag: an AI engineer who has never run an eval

If I could keep only one screen, this would be it. Ask the candidate to describe a system they built and then ask the follow-up that decides the interview: how did you know it was working? An engineer who has shipped real AI answers immediately and concretely, talking about a held-out test set, an accuracy or faithfulness number, a threshold they had to clear, a failure mode they tracked. An engineer who has been shipping on vibes goes quiet, or worse, says "it looked good in testing."

This matters because AI systems fail silently. A traditional bug throws an exception; a bad AI output is fluent, confident, and wrong, and it sails straight past anyone who is only eyeballing the demo. The discipline that catches it is evaluation, and a candidate who has never built an eval harness has no instrument for the exact failure mode that will hurt you most. I have written the full version of this argument in my work on the skills that actually separate AI engineers, and evaluation judgment is the one at the top.

Here is what a missing-evals red flag sounded like in a real loop, with the details changed to stay NDA-safe. A senior candidate could draw a clean retrieval pipeline on the whiteboard and name every component. When I asked how he measured retrieval quality, he said the team "spot-checked a few queries each week," and there was no frozen set, no recall number, no record of what had regressed. He was not lying; he genuinely did not know that the thing he had built was unmeasured, which meant he had no way to tell a good change from a bad one.

# The question that surfaces the no-evals red flag

You: "How did you know the system was working?"

# Green-flag answer

Them: "Frozen set of 300 real queries. Faithfulness 0.91,

human-disagreement under 8%. Anything below blocked the deploy."

# Red-flag answer

Them: "It looked good in testing. We kept an eye on it."

If you want the language to run this part of the loop cleanly, the eval and judgment section of my AI engineer interview questions gives you the exact prompts and what a strong answer sounds like.

The "can't explain a failure" red flag

Ask any AI engineer to walk you through a time their system failed in production. The answer tells you more than the rest of the interview combined. You are listening for a specific shape: a clear description of what broke, an honest account of why they did not catch it sooner, and a concrete change they made so it would not happen again. That shape is the signature of someone who has actually owned a system through its bad days.

The red flag is the candidate who cannot produce a real failure, or who produces one and blames it entirely on something outside their control. "The model just wasn't good enough." "The data was messy." "The deadline was unrealistic." Each of those can be true, and a strong engineer will name them as factors, but they will still tell you what they owned in the failure. The candidate who owns nothing has either never been close enough to a production system to feel a failure, or has a habit of externalizing blame that will not improve once they are on your team.

I weight this heavily because AI work is failure-dense by nature. You are building probabilistic systems on shifting model behavior and imperfect data; things break constantly, and the job is largely about catching and containing those breaks. An engineer with no honest failure narrative is telling you they have not done the part of the job that actually is the job.

The demo-not-production red flag

This is the one my second seat sees most clearly. A candidate shows up with an impressive demo: a polished chat interface, a slick retrieval app, an agent that books a meeting on stage. It works, and it is genuinely good work. And it tells you almost nothing about whether they can build the thing you actually need, because a demo and a production system are different disciplines that happen to share a vocabulary.

A demo runs once, for one user, on clean input, with no budget. A production system runs ten thousand times an hour, on adversarial input, against a latency target and a cost ceiling, with a human on call when it breaks. The skills that make a great demo, fast iteration and a good eye for the happy path, are not the skills that keep a system alive at p95 latency on a real bill. I made this case at length in my argument for shipping smaller models, where the entire point is that the model that wins the demo rarely wins the margin.

Surface it by asking production questions about the demo: what was your p95 latency, what did each call cost, and how did you handle the request the system got wrong in front of a user? A candidate who has lived in production has crisp answers and probably a war story. A demo-only candidate treats these as someone else's problem, which is the tell, because in the job you are hiring for, they are the engineer's problem.

The illustrative version: a candidate's take-home agent worked flawlessly in a notebook and fell over the moment we pointed two hundred concurrent requests at it, because every step made a fresh model call with no batching, no caching, and no timeout. The logic was correct. The engineering for production simply was not there, and he had never been forced to notice because a notebook never asked him to.

Communication and ownership red flags

Some of the most expensive red flags are not technical at all. The first is the engineer who cannot scope. You describe a fuzzy problem and ask how they would approach it, and instead of narrowing it into something shippable, they either freeze or sprint straight to the most complex possible architecture. Scoping is the daily work of an AI engineer, because almost every real task arrives vague, and an engineer who cannot turn vague into a first shippable slice will stall every project they touch.

The second is the engineer who cannot explain their work to a non-engineer. In an AI-native product, the model's behavior is a business decision, not just a technical one. When the system makes a call a customer does not trust, someone has to be able to say why in plain language. A candidate who can only describe their system in jargon will not be able to do that, and you will discover it at the worst possible moment, in front of a customer or a board.

The third, and the one I treat as nearly disqualifying, is the candidate who is certain about everything. AI engineering is a field where the honest answer is frequently "it depends, and here is how I would find out." A candidate who has a confident, definitive answer to every question, including the genuinely ambiguous ones, is showing you they cannot tell the difference between what they know and what they are guessing. That is the exact failure mode that ships a confidently wrong system.

Interview behaviors that should make you stop

Beyond the answers, the behaviors in the room carry signal. Watch for the candidate who name-drops models and papers but cannot connect any of them to a decision they made. Reciting the frontier is easy; choosing between two options under a real constraint is the job, and the gap between the two is where weak candidates live.

Watch, too, for how they handle being wrong: push gently on one of their answers and see what happens. A strong engineer engages, reconsiders, and either defends the position with better reasoning or updates it cleanly. A red-flag candidate gets defensive, doubles down, or quietly abandons the point without acknowledging the change. You are not testing whether they are right; you are testing whether they can think in front of you when the ground moves, which is what every hard production day asks of them.

One more, and it is subtle: the candidate who has clearly used an AI assistant to pre-bake answers and cannot go off-script. The tooling is fine, expected even, but if every answer is polished and none of them survives a follow-up, you are interviewing the assistant, not the engineer. The fix is the same as everywhere else in this piece: ask the second and third question, the ones no canned answer covers.

The red-flag table you can run your loop from

Here is the full set in one place: each red flag, why it costs you in production, and the question or check that surfaces it. Run these in the same order on every candidate and you have a structured screen instead of a vibe.

Red flag	Why it matters in production	How to surface it
Never run an eval	No instrument for silent, confident-wrong failures	"How did you know it was working?" Listen for a frozen set and a number.
Keyword-stuffed resume	Breadth posing as depth; no real tradeoff experience	Pick one keyword, go three questions deep on it.
Demo-only portfolio	Never met p95 latency, cost-per-task, or a rollback	Ask the demo's latency, per-call cost, and failure handling.
No honest failure story	Has not owned a system through its bad days	"Walk me through a production failure you owned."
Blames the model or data	Externalizes blame; will not improve on your team	Probe what they owned vs. what they blamed.
Cannot scope a fuzzy task	Stalls every vague project, which is most of them	Give a fuzzy problem; watch for a shippable first slice.
Certain about everything	Cannot separate knowledge from guessing	Ask a genuinely ambiguous question; listen for "it depends."
Cannot explain to a non-engineer	Model behavior is a business decision someone must defend	"Explain your system to a customer who does not trust it."

Green flags: what to weight up so you don't reject the right person

Red flags are only half a hiring instrument. Used alone they turn into a reason to reject everyone, and the cost of a false rejection in a market this tight is real. ManpowerGroup puts overall hiring difficulty at 72%, and CIO has reported a roughly 75% fail rate on basic AI skills assessments, partly because the assessments cannot tell different kinds of AI work apart (CIO). Screen too hard on the wrong signals and you will reject good engineers for bad packaging.

So weight these up. An engineer who says "I do not know, but here is how I would find out" is showing you judgment, not a gap. A quiet portfolio with one project that has a frozen test set and a documented failure beats a loud one with five demos and no measurement. A candidate who narrows your fuzzy problem into a small shippable slice in real time has just demonstrated the single most useful daily skill in the job.

And weight up honesty about limits. The engineer who tells you a model is not good enough for a task yet, and can say precisely how they would measure when it becomes good enough, is worth more than the one who promises everything works. The deeper version of how these signals compound across a whole team lives in my book The AI-Native Team. The short version: hire for judgment, discount the polish, and do not mistake confidence for competence in either direction.

If you would rather not run this gauntlet yourself, every one of these red flags is screened out before you meet a candidate through Devlyn's pre-vetted AI application engineers, who have already cleared a harder version of this loop on live production work.

Frequently asked questions

What is the single biggest red flag when hiring an AI engineer?

An engineer who cannot tell you how they measured whether their system worked. If the answer to "how did you know it was working" is "it looked good in testing," they have been shipping on vibes, and vibes do not survive production. A real practitioner names a frozen test set, a metric, and a threshold without hesitating.

How do I spot a fake AI engineer from their resume?

Look for keyword density with no depth behind it: a long list of RAG, agents, vector databases, and frameworks, with no explanation of a tradeoff they made. Pick one keyword and go three questions deep in the interview. A real engineer has scars on every layer; a keyword candidate gets vague by the second follow-up.

Are GitHub stars a good way to judge an AI engineer?

No. Stars measure marketing reach, not engineering judgment. A repo with hundreds of stars and no test set is a demo; a quiet repo with a frozen evaluation set and an honest failure log is the real signal. Read past the polish to the part that shows how they knew it worked.

What interview questions reveal AI engineer red flags fastest?

Three do most of the work: "how did you know it was working" surfaces the no-evals flag, "walk me through a production failure you owned" surfaces the ownership flag, and asking the latency and cost of their demo surfaces the demo-only flag. The full set lives in my guide to AI engineer interview questions.

The bad AI hire is rarely the one who fails the technical screen. It is the one who passes the demo, talks fluently, and cannot tell a correct output from a confidently wrong one once the system is live. Screen for that, balance it with the green flags, and you will avoid the most expensive mistake in AI hiring. If you would rather skip the loop entirely, the engineers placed through Devlyn's AI application engineer hiring have already been screened against every red flag here, and the fuller picture sits in the hiring AI engineers guide.

AI Hiring Mistakes That Cost the Most (and the Fixes)

Alpesh Nakrani — Sat, 25 Apr 2026 18:30:00 GMT

The most expensive AI hiring mistakes are not bad luck. They are predictable: hiring for hype, never testing evaluation skill, and the wrong role for your stage.

The most common and costly AI hiring mistakes are not subtle, and they repeat with almost boring regularity: hiring for hype and keywords instead of evidence, never testing the one skill that actually matters in production, hiring the wrong role for your company's stage, over-indexing on academic ML credentials, and ignoring the ownership and communication that an AI-native team lives or dies on. Underneath all of them sits a sixth mistake that quietly amplifies the rest: a slow, broken interview process that loses the few people who would have been right. None of these are exotic. They are the defaults you fall into when you treat an AI hire like a normal software hire.

I am an engineer who became a CRO, which means I have made these mistakes from one seat and watched them get made from the other. I have hired AI engineers, deployed them on real products, and I have also sat across the table selling AI work to companies that got the hire wrong and were paying for it. The pattern is the same every time: the mistake is rarely visible at the offer stage. It surfaces three months later, in production, when something a model said is wrong and nobody on the team can tell you why, or whether it matters.

This article is the honest version, written from both seats. For each mistake I will tell you what it looks like, what it tends to cost, and the specific fix. If you want the full hiring framework that sits above all of this, start with my guide to hiring AI engineers; this piece is about the ways that hiring goes wrong and how to avoid paying for them twice.

Key takeaway: AI hiring mistakes are predictable, not random. The same five or six failure modes account for most bad outcomes, which means they are preventable with a deliberate process.
Hype is not evidence. A resume listing every model and framework tells you what someone has read, not what they can ship or evaluate under production pressure.
Test evaluation, not recall. The single skill that predicts a good AI hire is the ability to look at model output and know whether it is correct, and most loops never test it.
Stage determines role. A research scientist on a five-person product team is a mistake even if the person is brilliant; you needed an application engineer who ships.
A bad hire is expensive, and a slow process is too. The cost of getting it wrong runs to a third of first-year salary or more, and a sluggish loop loses the candidates you most wanted before you ever make an offer.

Mistake one: hiring for hype and keywords instead of evidence

The most common AI hiring mistake is the easiest to fall into, because it feels like diligence. You read a resume that lists every model family, every orchestration framework, every vector database, fine-tuning, RAG, agents, evals, the whole vocabulary, and you think you are looking at a strong candidate. What you are actually looking at is a list of things the person has heard of. The vocabulary is free; anyone who reads the same blog posts you do can assemble that list in an afternoon.

Hiring for keywords filters for people fluent in the discourse, not people good at the work. Those are different populations, and in a field moving this fast the overlap is smaller than you would hope. Framework names on a resume have a shelf life measured in months; the judgment to know which framework is the wrong tool for your problem does not show up as a keyword at all.

The fix is to make every claim earn its place with evidence. For each significant item on the resume, ask the person to walk you through one real thing they built with it: what the problem was, what they tried first, why it failed, and what they changed. People who did the work answer in specifics and contradictions, because real systems are full of both; people who pattern-matched the keyword answer in generalities and brochure language. You will know inside two questions which one you have.

Mistake two: never testing the one skill that matters, which is evaluation

Here is the AI recruitment mistake that costs the most and gets tested the least. The single most predictive skill for a production AI engineer is evaluation: the ability to look at a model's output and know, quickly and correctly, whether it is right, why it is wrong when it is wrong, and whether the failure mode is the kind you can ship around or the kind that ends a customer relationship. Almost no interview loop tests this directly. They test coding, system design, whether the candidate can explain attention, none of which tells you whether the person can catch a confident, plausible, wrong answer before it reaches a user.

This matters because the failure mode of modern AI is not a crash. It is a fluent, confident, completely wrong output that looks exactly like a correct one. A team that cannot tell the difference does not know they are shipping broken work until a customer tells them, and by then the trust is already spent. The skill that prevents this is judgment under ambiguity, and it is invisible on a resume and absent from most interview rubrics.

The failure mode of modern AI is not a crash. It is a fluent, confident, wrong answer that looks exactly like a correct one.

The fix is to put real, messy model output in front of the candidate and ask them to evaluate it rather than produce it. Show them a generated answer with a subtle error buried in it and ask what is wrong, what they would need to know before shipping it, and how they would build a check that catches this class of error automatically. Strong candidates dig into the output and reason about failure modes; weaker ones immediately pivot to how they would have generated something better, which tells you they are wired for throughput, not for the evaluation work that keeps an AI product safe. If you want the full interview design for this, I have written it up in detail in how to vet AI engineers.

Mistake three: hiring the wrong role for your company's stage

A surprising amount of AI hiring pain comes from hiring a genuinely excellent person for the wrong job. The classic version is a five-person product team that hires a research scientist with a strong publication record to build a customer-facing feature. The person is brilliant, and also the wrong hire, because at that stage you did not need someone who can advance the state of the art. You needed someone who can take a good-enough model, wire it into a product, instrument it, and ship it to real users this quarter.

The mirror-image mistake also happens: a mature org with a real research agenda hires a fast application engineer and then wonders why nobody is pushing the modeling frontier. Neither person failed. The role was scoped to the wrong stage of the company. This is one of the more expensive mistakes because it can look like success for months, the person is busy and productive, before anyone notices the work being produced is not the work the business needed.

The fix is to decide what stage you are at before you write the job description, and to be honest about it. Most companies shipping AI features need application engineers who can ship and evaluate, not researchers who can publish. If you are not sure which one you need, that uncertainty is itself a signal worth resolving before you open the role; my piece on when to hire an AI engineer walks through how to tell. Match the role to the constraint you actually have, not to the most impressive person you can imagine hiring.

Mistake four: over-indexing on academic ML over applied judgment

This one is close to the stage mistake but worth separating, because it operates as a bias even when the role is scoped correctly. There is a deep-seated instinct, especially among non-technical hiring managers, to treat a PhD, a strong publication list, or a famous lab on the resume as the top signal for an AI hire. For most production AI work, it is one of the weaker signals you can lead with.

Academic ML and applied AI engineering are related but genuinely different disciplines. Research rewards novelty, theoretical depth, and pushing a benchmark a fraction of a point. Production rewards reliability, evaluation, cost control, and the judgment to know when good enough is good enough. A brilliant researcher can absolutely make a brilliant applied engineer, but the credential does not predict it, and treating it as the headline signal will cause you to pass over the people who are best at the actual job.

The fix is to weight shipped, instrumented, in-production systems above papers and pedigree when the role is an applied one. Ask what they put in front of real users, how they knew it was working, what broke, and what they did about it. The honest answer to "what is the worst thing your model did in production and how did you catch it" tells you far more than a citation count. The cost of getting this wrong compounds, which is part of why the cost of an AI engineer is worth modeling against the value they actually create, not the prestige they carry.

Mistake five: ignoring ownership and communication

The quietest expensive mistake is hiring a strong technical contributor who cannot or will not own an outcome and cannot communicate clearly about uncertainty. On a traditional team this is survivable; you route around it with process. On an AI-native team it is corrosive, because so much of the work is judgment that has to be communicated, defended, and owned rather than handed off.

When a model's behavior is ambiguous, somebody has to decide whether it is good enough to ship and stand behind that decision. When an evaluation result is borderline, somebody has to communicate the trade-off honestly to people who cannot read the traces themselves. An engineer who hoards uncertainty, who says "the model does what it does" and shrugs, or who cannot explain a risk to a non-technical stakeholder, creates a fog that no amount of raw skill burns off. I have written about why this matters structurally in the hiring pillar: an AI-native team is built around judgment you can observe and outcomes someone owns, not throughput you can count.

The fix is to test communication and ownership as first-class criteria, not nice-to-haves. In the interview, give the candidate a genuinely ambiguous situation and watch whether they take a position and own it or hedge until you make the call for them, then ask them to explain a technical risk to you as if you were a non-technical executive. The people you want make the ambiguity smaller. The people you do not want make it your problem.

Mistake six: a broken, slow process loses the people you actually wanted

The last mistake is not about the candidate at all. It is about you. The best AI engineers are in a market where they have options, and a hiring loop that is slow, disorganized, or vague about what the role actually is will lose them before you ever extend an offer. The candidates least bothered by a sloppy seven-week process are, predictably, the ones with the fewest other options, which is an adverse selection problem you are creating with your own calendar.

There is a deeper version of this. The reasons software projects fail are well documented, and unclear requirements sit near the top; the Standish CHAOS data attributes roughly a quarter of project failures to fuzzy or shifting requirements, with only about thirty percent of IT projects succeeding cleanly. The same fog that sinks projects sinks hiring loops. If your team cannot articulate what the role is, what good output looks like, and what the first ninety days deliver, you will run an inconsistent process, send mixed signals, and make a worse decision at the end of it.

The fix is to treat the process as part of the product. Decide the role, the rubric, and the evaluation exercises before you talk to anyone, compress the loop to days rather than weeks, and give every candidate the same well-designed exercise so you are comparing like with like. A tight, respectful, well-scoped process is also your best recruiting tool: the strong candidates can tell from the inside whether you know what you are doing, and they are evaluating you exactly as hard as you are evaluating them.

The mistakes, the costs, and the fixes

Here is the whole pattern in one place. The costs are illustrative, drawn from what these mistakes tend to run in practice rather than from any single company's books, but the direction and rough magnitude are real. A bad hire alone is commonly estimated at around thirty percent of first-year earnings by the US Department of Labor, and for a specialized AI role that floor is generous.

Mistake	What it tends to cost	The fix
Hiring for hype and keywords	A confident hire who cannot ship; months lost before it shows	Make every resume claim earn its place with a specific, real example
Never testing evaluation skill	Broken output reaches customers; trust spent before you notice	Put messy model output in front of them and ask them to evaluate it
Wrong role for your stage	A brilliant person doing work the business did not need	Scope the role to your actual stage before writing the job description
Over-indexing on academic ML	Passing over the best applied engineers for the most credentialed	Weight shipped, instrumented production systems above papers
Ignoring ownership and communication	A fog of unowned uncertainty that no raw skill burns off	Test ownership and clear risk communication as first-class criteria
A slow, broken process	Losing the candidates you most wanted before the offer	Tight, consistent, well-scoped loop measured in days, not weeks

A few illustrative numbers make the table concrete. Consider a $180k senior AI engineer hired on hype: the DOL's thirty-percent floor puts the direct cost of getting it wrong around $54k, and that is before the months of misdirected work, the production incident a missing evaluator lets through, and the recruiting cost of doing it all again. In practice, for a specialized AI role I would model the all-in cost of a bad hire closer to a full year of salary once you count lost momentum and the opportunity cost of the right person you did not hire.

A second illustration, on the process side. Picture two companies chasing the same shortlist of strong AI engineers: one runs a focused five-day loop built around an evaluation exercise, the other a six-week loop with vague scope and four redundant rounds. The first company makes offers while the candidates are still interested; the second sends its offers into a void because the people it wanted accepted somewhere else two weeks earlier. Same talent pool, opposite outcomes, and the only variable that differed was the process the companies controlled entirely.

If you are staring at a hire you are not confident you can run well, that is a legitimate reason to bring in people who do this every day. My team works on exactly this problem; you can hire vetted AI application engineers through Devlyn and skip the most expensive mistakes on this list entirely.

The both-seats summary: the mistake is almost always upstream of the hire

The thread running through every one of these mistakes is that the failure happens before the offer, in how you scoped the role, designed the loop, and decided what you were testing for. The candidate is downstream of all of it. When a hire goes wrong, the instinct is to blame the person, but in my experience the decision that doomed it was made weeks earlier, when someone wrote a job description full of keywords and a rubric full of nothing.

That is also the good news. Upstream mistakes are fixable, and cheaply relative to what they cost downstream. Deciding your stage, weighting evidence over hype, testing evaluation directly, and running a tight process are all things you control completely and can change before you talk to a single candidate. None of it requires you to be a machine learning expert; it requires you to be honest about what you need, disciplined about testing the skill that matters, and respectful enough of strong candidates to run a process worthy of them.

The deeper framework for building a team this way, around judgment you can observe rather than throughput you can count, is the subject of Building an AI-Native Team, which is where I would point you if you want the full system rather than the list of pitfalls.

Frequently asked questions

What is the most common AI hiring mistake?

Hiring for hype and keywords instead of evidence. A resume that lists every model, framework, and technique tells you what someone has read, not what they can ship or evaluate under production pressure. The fix is to make every claim earn its place: for each item, ask the candidate to walk you through one real thing they built with it, including what failed and what they changed. People who did the work answer in specifics; people who pattern-matched the vocabulary answer in generalities.

How much does a bad AI hire actually cost?

The US Department of Labor's commonly cited floor is about thirty percent of first-year earnings, which already lands in the tens of thousands for a senior role. For a specialized AI engineer, the real figure is usually higher once you count the misdirected work, any production incident a missing evaluator let through, the recruiting cost of doing it again, and the opportunity cost of the right person you did not hire. It is reasonable to model the all-in cost of a bad AI hire at close to a full year of that role's salary.

What is the one skill I should test that most interview loops skip?

Evaluation. The ability to look at a model's output and quickly tell whether it is correct, why it is wrong when it is wrong, and whether the failure is shippable or relationship-ending is the single most predictive skill for a production AI engineer, and almost no loop tests it directly. Put real, slightly-wrong model output in front of the candidate and ask them to evaluate it rather than produce something better. The ones who dig into the failure are the ones who will keep your AI product safe.

Should I prioritize a PhD or research background when hiring AI engineers?

For most production AI work, no, not as your headline signal. Academic ML rewards novelty and theoretical depth; applied AI engineering rewards reliability, evaluation, cost control, and shipping. A great researcher can become a great applied engineer, but the credential does not predict it. For an applied role, weight shipped, instrumented, in-production systems above papers and pedigree, and ask what they put in front of real users and how they knew it was working.

Building an AI Team: The Order You Actually Build It In

Alpesh Nakrani — Fri, 24 Apr 2026 18:30:00 GMT

Building an AI team is a sequencing problem, not a headcount problem. Here is the order I build them in, first hire to scaling, without the bloat.

Building an AI team starts with one decision that most people get wrong: who you hire first. The answer is not a researcher, not a data scientist, and not a cheap junior to "prototype something." Your first hire is a senior generalist who can own an outcome end to end, read model output and know whether it is correct, and ship something a real user touches. From there the sequence is narrow and ordered: prove value with that one person, add an evaluation owner, add domain depth, then add product surface. You do not staff to an org chart; you staff to the next bottleneck.

I am an engineer who turned into a CRO, which means I sit in two seats at once: I read the traces and I read the P&L. I have built AI teams from scratch at Devlyn and watched dozens of other companies try to do the same. The pattern in the failures is almost always the same, and it is not a talent problem; it is a sequencing problem. People hire the roles a mature AI org has before they have done the work that justifies any of those roles, and they end up with an expensive team that cannot ship.

This is the companion to my definitive guide to hiring AI engineers, which covers what good looks like and what it costs. This piece is about order: the sequence I would follow if I were standing up an AI team this quarter, and the mistakes I would refuse to repeat.

The first hire is judgment, not throughput. A senior generalist who can own an outcome beats two juniors and a researcher every time at the start.
Sequence to the bottleneck, not the org chart. Prove value with one person, then add an eval owner, then domain depth, then product surface. Each hire should unlock the next, not fill a box.
Start augmented unless you have a reason not to. A senior team you can stand up in days buys you the proof you need before you commit to permanent headcount.
Evals and ownership are the team, not a process you add later. The thing that makes a small AI team fast is confident evaluation, and that is cultural before it is technical.
Scaling means raising judgment density, not headcount. The teams that stay fast keep the senior-to-junior ratio high and resist the urge to grow for its own sake.

Your first hire is a senior generalist who can own an outcome

The single most consequential decision in building an AI team is the first hire, because that person sets the floor for everything after. I want someone who can take an ambiguous business problem, decide whether AI is even the right tool, build a thin version that touches a real user, and tell me honestly whether it worked. That is a generalist with deep judgment, not a specialist with a narrow tool.

The temptation is to hire for the resume keyword. You think you need an AI team, so you go find someone whose title says "machine learning" and whose LinkedIn lists the frameworks. That is the wrong filter, because the frameworks are learnable in a week. What you cannot teach quickly is the calibration to look at a plausible-looking output and know it is wrong, and the discipline to own the outcome rather than the artifact.

I learned this the expensive way, watching a company hire a brilliant researcher as their first AI person. He could explain attention mechanisms beautifully, but he had never shipped anything a customer used, and when the first prototype produced confident nonsense, he treated it as an interesting research finding rather than a fire. Six months in they had papers' worth of analysis and nothing in production. The first hire has to be someone whose instinct, when the model is wrong, is to fix the product, not to study the failure.

This is also why I run a senior-only posture at the start. The gap between a plausible wrong answer and a correct one is invisible without deep expertise, and a junior cannot see it yet. I go deeper on that tradeoff in senior versus junior AI engineers, but the short version is that hiring people who cannot see the gap does not reduce your risk. It buries it where you will find it in front of a customer.

The first hire has to be someone whose instinct, when the model is wrong, is to fix the product, not to study the failure.

Sequencing the next three roles after the first

Once your first hire has shipped something real and you have evidence that AI moves a number that matters, you add the next roles in order of bottleneck. The order is not arbitrary. Each hire should exist to remove the specific constraint that is now slowing the person before them.

The second hire is almost always an evaluation owner. Once you are shipping AI into production, your bottleneck stops being generation and becomes confident evaluation: knowing whether the output is good enough to ship, and catching regressions before users do. This person builds and owns the eval suite, defines failure modes, and turns "it seems fine" into a measurable gate. They are the reason your first engineer can move fast instead of looping on every change.

The third hire is domain depth: someone who knows your specific problem space cold, whether that is healthcare coding, financial compliance, or retail merchandising. A generalist plus an evaluator can build a competent system, but they will miss the domain-specific failure that only an expert sees coming. The domain hire makes the evals smarter and the product correct in ways a generalist never could.

The fourth hire is product surface: the person who owns how the AI capability shows up to the user, the UX of trust, the explanation, the fallback. By this point you have something that works; this hire makes it something people want to use. If you are wondering whether you are even ready for the first of these, my piece on when to hire an AI engineer covers the signals that mean it is time and the ones that mean wait.

In-house, augmented, or both: how to start without guessing

The question I get most from founders is whether to build the team in-house or use an augmented team. The honest answer is that at the start, you almost certainly want augmented, and not for the reason people assume. It is not mainly about cost. It is about reversibility.

When you are building an AI team from scratch, your single biggest risk is committing permanent headcount to a bet you have not validated. A full-time senior AI hire is a long, expensive search, a relocation conversation, an equity grant, and a person who is genuinely hard to unwind if the direction changes in three months. Building AI products in 2026 means the direction changes in three months. You want to learn fast and stay reversible while you do.

An augmented senior team lets you stand up real capability in days instead of the quarter a great in-house hire takes, prove or kill the bet, and only then decide what to make permanent. I have watched companies burn six months and a recruiting budget hiring a full team for an initiative that the first two weeks of real work would have told them to scope completely differently. Start augmented, learn, then hire in-house against the roles you now know you need.

This is not a permanent answer. The right end state for most companies is a small in-house core that owns the strategy and the most defensible work, supported by augmented depth for the rest. I think through that balance more fully in in-house versus outsourced AI. The mistake is treating it as a binary you decide once, rather than a ratio you tune as you learn.

If you want to start that way, standing up a senior team in days rather than a quarter is exactly the work we do at Devlyn. We can put senior application engineers on your problem in 48 hours, which is the fastest path I know to the evidence you need before you commit headcount.

The eval-and-ownership culture is the team, not a process bolt-on

Here is the thing nobody tells you when you are building an AI team: the team is not the org chart. The team is the culture around evaluation and ownership, and that culture is what makes a small group fast. You can hire the right roles and still be slow if the people in them are not wired to evaluate honestly and own outcomes.

The reason is structural. When generation is cheap, the bottleneck on speed is rarely producing the output; it is being confident the output is good enough to ship. A team that has strong evaluative capacity ships, because they know when to stop, while a team that does not loops endlessly, because nobody can say "this is good" with conviction, so everything gets re-litigated. The eval culture is the difference between a team that moves and one that thrashes.

I explored the deeper version of this in what a team is for after the machine does the work: when the machine does the producing, what you need humans to do is specify, evaluate, and own. So when I build a team, I hire for that and I measure for it. The internal shorthand I use is ownership over hours, outcomes over velocity. I do not care how busy someone looks; I care whether the outcome was good and whether they drove it.

Practically, that means the eval suite is not a thing one person maintains in a corner. Every engineer reads the evals, every engineer is expected to know the failure modes, and "it passed the evals" is a higher standard than "it works on my machine" ever was. The team that internalizes this early builds a compounding advantage, because their floor on quality keeps rising while their loop time keeps falling.

The team is not the org chart. The team is the culture around evaluation and ownership, and that culture is what makes a small group fast.

Tooling and process that scale a small team's output

A small AI team with the right tooling outperforms a large one without it, because the output per person is so much higher. But tooling for an AI team is not the same as tooling for a traditional software team, and the difference is where most people under-invest. You are not just shipping code. You are shipping a system whose behavior you cannot fully predict, which means observability and evaluation are first-class, not afterthoughts.

The non-negotiable tooling, in rough order of when you need it, looks like this. First, version control and CI for code, obviously, and second, an eval harness that runs on every change and gates deploys, the same way tests do. Third, production observability that lets you see what the model actually did with real inputs, not just what it did in your test set. Fourth, a fast path from a production failure back into the eval suite, so every real-world mistake becomes a permanent test.

// The loop that actually scales a small AI team production_failure -> add_to_eval_set // capture every real miss eval_set -> gate_on_every_deploy // never reship a known failure deploy -> observe_in_production // watch real inputs, not test ones observe -> production_failure // the loop closes; the floor rises

The process discipline that matters most is keeping this loop tight. Every time a real failure makes it into production and does not get captured as a test, you have spent the lesson and kept none of it. The teams that scale their output are the ones where the loop is so routine that nobody has to be reminded. If you want help standing up the observability and eval infrastructure rather than just the team, that is core to how Devlyn builds AI systems that hold up under real traffic.

Scaling without bloat: keep judgment density high

The most counterintuitive part of building an AI team is that scaling does not mean hiring. It means raising judgment density: the share of your team that can look at output and know whether it is right. When you add people who cannot do that, you do not scale capability. You add review burden, because now a senior person has to check the junior's work, and you have traded one bottleneck for another.

The math is genuinely different now. One senior engineer who can architect and evaluate is worth several production-oriented people in terms of output quality you can actually trust. So when the obvious move is to hire five more engineers to go faster, the better move is often to hire one more senior and give the existing team better tooling. You grow by increasing output per person, not by increasing the headcount you have to coordinate.

This keeps the team flat, which is itself an advantage. Fewer layers between the person setting intent and the output means less translation loss and faster decisions. I have seen this firsthand: a five-person senior AI team I worked with out-shipped a twenty-person team at a larger competitor, not because they were smarter, but because every one of the five could own an outcome and none of them was waiting on anyone else to evaluate their work.

The discipline is resisting the headcount reflex. Boards and leaders are conditioned to read growing headcount as growing capability, and in an AI-native team that proxy is broken. The honest signal is throughput of trusted outcomes per person, and that number goes up when you raise judgment density, not when you add bodies. For the role-by-role view of what that flat, senior-heavy shape looks like, I lay it out in AI team structure.

The build mistakes I see most often

I have watched enough teams get built to know where they go wrong, and the mistakes cluster. None of them are exotic. They are the predictable result of staffing to an org chart instead of to the work.

The first mistake is hiring research before product. A research-first first hire produces analysis, not shipped value, and you burn your runway learning things that a thin production prototype would have taught you in a week. Research has its place, but it is rarely the place you start when you are building an AI team to move a business number.

The second mistake is hiring juniors to save money. The apparent savings are real on the spreadsheet and illusory in practice, because a junior on an AI team produces output that a senior now has to evaluate, and the evaluation is the expensive part. You have not saved money; you have added review load and buried risk where it will surface in front of a customer. I made the full case for this in senior versus junior AI engineers.

The third mistake is treating evals as a phase-two concern. Teams ship first and promise to add evaluation "once we have something." Then they have something, it breaks in a way they cannot reproduce, and they have no harness to catch it. Evals are not a maturity milestone. They are how you know whether you have built anything at all.

The fourth mistake is over-hiring on early momentum. The first prototype works, excitement is high, so you hire a team for the roadmap you imagine rather than the work you have validated. Then the direction shifts, as it always does, and you are carrying headcount built for a plan that no longer exists. Stay lean longer than feels comfortable, and let the validated work pull the next hire, not the other way around.

A roadmap you can map to your stage

Every company is different, so I resist a one-size org chart. But the sequence is stable enough that I can describe it as stages. Find the row that matches where you are, and the focus column tells you what the next hire is for.

Stage	Roles in place	Primary focus
0 — Validate	1 senior generalist (often augmented)	Ship a thin prototype to a real user; decide if AI is even the right tool
1 — Prove	+ evaluation owner	Build the eval suite and the deploy gate; make quality measurable
2 — Deepen	+ domain expert	Catch domain-specific failures; make the system correct, not just plausible
3 — Surface	+ product/UX owner	Make the capability something users trust and want to use
4 — Scale	Small senior core + augmented depth	Raise judgment density and output per person; resist headcount growth

The roadmap is a guide, not a checklist to race through. Most companies should sit in stages zero and one far longer than they want to, because that is where the cheap learning is. The full playbook for hiring against this shape, what good looks like at each role and what it costs, is in my book Building an AI-Native Team: Hiring for judgment, not throughput. If you read one thing after this, read that.

Frequently asked questions

Who should be the first hire when building an AI team? A senior generalist who can own an outcome end to end: take an ambiguous problem, decide whether AI is the right tool, ship a thin version a real user touches, and tell you honestly whether it worked. Not a researcher, not a junior, and not a narrow specialist. The first hire sets the quality floor for everyone after, so you hire for judgment and ownership, not for the framework keywords on a resume.

How do you build an AI team from scratch without overspending? Start augmented rather than committing permanent headcount to a bet you have not validated. A senior team you can stand up in days lets you prove or kill the idea before you carry the cost of a full-time search and equity grant. Stay lean through validation, let the proven work pull each next hire, and resist the urge to staff the roadmap you imagine instead of the work you have done.

In-house or augmented for an AI team? At the start, almost always augmented, mainly for reversibility rather than cost: the direction will change in three months, and you want to stay nimble while you learn. The right end state for most companies is a small in-house core that owns strategy and the most defensible work, supported by augmented depth for the rest. Treat it as a ratio you tune, not a binary you decide once.

How big should an AI team be? Smaller than you think, and you scale by raising judgment density rather than headcount. One senior engineer who can architect and evaluate is worth several production-oriented people in trusted output, and adding people who cannot evaluate their own work just adds review burden. Keep the team flat and senior-heavy, and measure throughput of trusted outcomes per person rather than the size of the org chart.

If you are building an AI team and want to start with senior people on your problem in 48 hours instead of a quarter of recruiting, that is exactly what we do at Devlyn.

What Is an AI Engineer? The Role, Explained by a Hirer

Alpesh Nakrani — Thu, 23 Apr 2026 18:30:00 GMT

What is an AI engineer? Someone who builds production AI features on foundation models. Here is the role, what they do, and when you need one.

An AI engineer is the person who turns a foundation model into a feature your customers can actually use. They do not train models from scratch and they do not write research papers. They take models that already exist, wrap them in retrieval, tool calls, evals, and product logic, and ship something that holds up when a real user hits it at 2am. That is the whole job in one sentence, and almost everything written about the role buries it under a list of frameworks.

I am writing this from the hiring side of the table. I started as an engineer and now run conversion and revenue, which means I sit in two seats at once: I read the traces, and I read the P&L. At Devlyn I have hired more than 80 senior AI engineers and shipped over 200 products on top of them. So when someone asks me what an AI engineer is, I am not answering from a course syllabus. I am answering from the cost of getting the definition wrong, because every company that misunderstands this role hires the wrong person for it.

So if you came here asking what is an AI engineer, the short answer is the one above, but the useful answer is about the boundaries of the AI engineer role. The title is new, the field moves weekly, and three different jobs all have "AI" or "ML" in them. If you hire an AI engineer expecting a researcher, or a data scientist expecting an AI engineer, you pay a senior salary for a skill set that does not match the work.

This piece is the definition I wish more hiring managers had before they opened a req. It answers what an AI engineer is, what they do, how the role differs from the adjacent ones, and when your company actually needs to hire for it.

An AI engineer builds, they do not invent. Their output is a working production feature on top of an existing model, not a new model or a published result.
The role is defined by what ships, not by tools. The model names, the vector database, the orchestration library are all learnable. Production judgment is the scarce part.
It is a distinct role from ML engineer and data scientist. Different inputs, different outputs, different failure modes. Confusing them is the most expensive hiring mistake in this category.
The market is real and priced accordingly. Median base pay for the role sits around $134K in the US, and the good ones cost more because they are rare.
Not every company needs one yet. You need an AI engineer when an AI feature is going in front of customers and the cost of it being wrong is real.

What an AI engineer actually is

The term is recent. Shawn Wang coined "AI Engineer" in a 2023 essay called The Rise of the AI Engineer, and his framing still holds up. His argument was that foundation models had pushed a huge amount of capability out of the research lab and into the hands of ordinary software engineers. Tasks that used to need a research team and five years now need API docs and a focused afternoon.

His sharpest line is the one most hiring managers miss: "One can be quite successful in this role without ever training anything." That is the load-bearing distinction. An AI engineer can have a long, successful career and never once train a model, tune a loss function, or touch a GPU cluster. They consume models the way a backend engineer consumes a database. The model is an input to their work, not the work itself.

The supply math explains why the AI engineer role exists at all. Wang estimated something like five thousand people on the planet who can genuinely train frontier models, against fifty million software engineers who could learn to build products on top of them. The bottleneck was never going to be researchers. It was always the engineers who could take what the researchers built and make it survive contact with a customer, and that second group is what we now call AI engineers.

So the honest definition is functional, not technical. An AI engineer is a software engineer whose specialty is shipping reliable features powered by foundation models. The specialty shows up in the things that break: hallucination, latency, cost per call, prompt fragility, tool-call failures, and the gap between a demo that works once and a feature that works ten thousand times a day. If you want the deeper skills breakdown, I wrote a full piece on the skills that actually matter, and you can go there once you have the shape of the role.

What an AI engineer actually does day to day

Ask ten people what an AI engineer does and you will get ten lists of tools. That is the wrong altitude. The day-to-day is better described by the problems they own, because the tools change every quarter and the problems do not.

An AI engineer spends their day deciding what the model should and should not be asked to do, then building the scaffolding around it so the answer is right often enough to ship. That means designing prompts and context, wiring up retrieval so the model has the right facts, adding tool calls so it can act instead of just talk, and building the evals that tell you whether any of it is working. Then they watch it in production and fix the failures the evals did not catch.

Here is a concrete, NDA-safe version. A team I worked with had a support assistant that looked great in the demo and fell apart in the wild. The AI engineer's first week was not spent swapping models; it was spent reading two hundred real transcripts and sorting the failures into buckets. The discovery was that 60% of the bad answers came from one retrieval gap: the model could not see the customer's order history, so it guessed.

The fix was plumbing, not intelligence. That is the job, and the model was never the problem.

A second example, also illustrative. Another build kept blowing its latency budget because every request hit the largest available model. The AI engineer routed the easy 80% of requests to a smaller, cheaper model and reserved the expensive one for the hard tail. Cost dropped, speed improved, and quality held, because most requests never needed the big model in the first place.

Knowing which requests are which is the kind of judgment you are paying for. A lot of that work lives in agentic workflows, where the model is not just answering but taking steps.

If you are early in scoping a hire and want this in a format you can hand to recruiting, our team put the role into a usable spec on the AI application engineer page, which describes exactly the production work I am talking about here.

A responsibility-to-reality table

The reason "AI engineer" sounds vague is that job descriptions list responsibilities at a level of abstraction that tells you nothing. Here is the same set of responsibilities translated into what they actually look like on a Tuesday.

Responsibility	What it looks like day to day
Model integration	Calling a foundation model via API, handling timeouts, retries, and rate limits like any other unreliable dependency
Retrieval and context	Building the pipeline that finds the right documents and feeds them to the model so it answers from facts, not guesses
Prompt and output design	Shaping inputs and enforcing structured outputs so downstream code can trust what comes back
Evaluation	Building a frozen test set, scoring each change against it, and refusing to ship on vibes
Cost and latency control	Routing requests to the cheapest model that clears the bar; watching cost per resolved task, not cost per call
Production observability	Reading traces, catching the failures evals missed, and closing the loop with a fix

Notice that none of these rows say "train a model." That row belongs to a different job, which is the source of most of the confusion, so let me draw the line clearly.

The skills that actually matter

The skills that get listed on job posts are the easy ones to name and the least predictive of success. Yes, an AI engineer should know the common orchestration libraries, vector databases, and model APIs. But those are table stakes, and they are learnable by any competent engineer in a few weeks. They are not what separates a good hire from an expensive mistake.

The scarce skill is production judgment: the ability to look at a model output and know whether it is correct, why it is wrong when it is wrong, and what to change before it touches a customer. I have watched two engineers with identical resumes diverge completely on this. One ships a feature, sees it pass the demo, and moves on. The other ships the same feature, distrusts the demo, builds an eval set, and finds the 8% failure rate that would have cost the company a week of firefighting.

Same tools, different judgment. Only one of them is worth a senior salary.

The model names, the vector database, the orchestration library are all learnable in weeks. Production judgment is the part you cannot teach in an onboarding doc, and it is the part you are actually paying for.

This is why evaluation skill is the truest signal in the whole role. An engineer who treats LLM evaluation as a first-class part of the work, not an afterthought, is an engineer who understands that the model will be confidently wrong sometimes and that catching it is their job, not the user's. When I interview for this role, how a candidate talks about measuring quality tells me more than anything on their resume.

AI engineer vs ML engineer vs data scientist, at a glance

These three roles get used interchangeably, and the cost of confusing them is real. They take different inputs, produce different outputs, and fail in different ways. Here is the distinction in one view.

Role	Primary output	Works mostly with	Hire them when
AI engineer	A production feature built on an existing model	Foundation model APIs, retrieval, evals, product code	You are shipping an AI feature to users
ML engineer	A trained or fine-tuned model and its pipeline	Datasets, training loops, GPUs, model architecture	You need a custom model the market does not sell
Data scientist	An analysis, insight, or prediction	Statistics, notebooks, dashboards, experiments	You need to understand your data and answer questions

The shortest way to remember it: a data scientist tells you what the data says, an ML engineer builds a model when you cannot buy one, and an AI engineer ships the feature on top of a model someone already built. Most companies in 2026 need the third role far more often than the first two, because foundation models removed the need to train your own for the majority of product use cases. If you are still deciding which one you actually need, our definitive guide to hiring AI engineers walks through the call.

When your company actually needs an AI engineer

Here is the part the career listicles skip, because it only matters if you are the one writing the check. Not every company needs a dedicated AI engineer yet, and hiring one too early is its own kind of waste.

You need an AI engineer when three things are true at once. First, you are putting an AI-powered feature in front of real users, not running an internal experiment. Second, the cost of that feature being wrong is real, whether that is a bad customer experience, a compliance risk, or money. Third, the feature is core enough that you cannot afford to have it half-work indefinitely.

When all three hold, the production judgment of an AI engineer is the thing standing between you and a feature that embarrasses you in front of a customer.

You probably do not need one yet if your AI usage is internal tooling, occasional, and low-stakes, the kind of thing a strong generalist engineer can bolt on with an API key. The mistake I see most often runs the other way, though: a company ships an AI feature with a generalist, it works in the demo, it fails quietly in production, and nobody on the team has the eval discipline to notice until support tickets pile up. By then the fix is more expensive than the right hire would have been. The timing question deserves its own treatment, which is why I wrote about when to hire an AI engineer separately.

There is also a cost reality to be honest about. The median base for this role sits around $134,000 in the US per Coursera's salary data citing Glassdoor, and the genuinely good ones command more because the supply is thin. That is not a number to flinch at if the feature matters; it is a number to plan around. I broke down the full picture, including the senior premium, in what an AI engineer actually costs.

How to hire one without getting burned

Once you know you need the role, hiring well comes down to testing for the scarce thing, not the learnable thing. Most hiring processes for this role test framework recall, which is exactly backward. A candidate who can name every orchestration library and cannot tell you how they would catch a hallucination in production is the wrong hire wearing the right keywords.

Test for judgment instead. Show a candidate a real model output that is subtly wrong and watch whether they catch it and explain why. Ask how they would build the eval set for a feature, and listen for whether they reach for real production traffic or for a number that makes them feel good. The answers separate the operators from the resume-keyword matchers fast.

I put the full interview approach, including the questions that actually predict success, into a job description and hiring framework you can adapt.

If you would rather not run that gauntlet yourself, that is precisely the gap our team fills. Hiring senior AI application engineers who have already proven this judgment in production is what Devlyn does, so you skip the part where you learn the failure modes by hiring wrong first. And if you are building the wider team around the role, the playbook is in my book Building an AI-Native Team.

Frequently asked questions

What is an AI engineer in simple terms? An AI engineer is a software engineer who builds production features on top of existing foundation models. They do not train models from scratch. They take models that already exist and wrap them in retrieval, tool calls, evals, and product logic so the result is reliable enough to put in front of real users.

What does an AI engineer do day to day? They design what the model is asked to do, build the retrieval and tool-calling scaffolding around it, write evals that measure whether it works, and then watch it in production and fix the failures. The work is mostly engineering plumbing and judgment, not model math. Reading real traces and catching the cases the demo hid is a large part of the job.

What is the difference between an AI engineer and an ML engineer? An ML engineer trains and fine-tunes models, working with datasets, training loops, and GPUs. An AI engineer ships products on top of models that already exist, working with APIs, retrieval, and evals. You hire an ML engineer when you need a custom model the market does not sell, and an AI engineer when you are putting an AI feature in front of users.

Do you need a PhD to be an AI engineer? No. The role was defined specifically by the fact that you can succeed in it without ever training a model or holding a research degree. What matters is software engineering ability plus production judgment about where AI features break. The scarce skill is knowing whether an output is right and why it is wrong, not the academic credential.

If you want the full hiring playbook, start with the definitive guide to hiring AI engineers. And if you would rather hire a senior AI engineer who has already shipped production features and proven the judgment this article describes, that is exactly what the Devlyn team does. Define the role correctly, then hire for the part you cannot teach.

AI Engineer vs ML Engineer: What Actually Differs

Alpesh Nakrani — Wed, 22 Apr 2026 18:30:00 GMT

An AI engineer wires existing models into a product; an ML engineer builds and trains the model. Here is the real difference, and who to hire when.

The short version of the ai engineer vs ml engineer question is this: an AI engineer takes models that already exist and wires them into a working product, while an ML engineer builds and trains the model itself. If you are shipping a feature on top of GPT, Claude, or an open-weight model, you need an AI engineer first. If you have proprietary data and a prediction problem no off-the-shelf model solves, you need an ML engineer. Most companies in 2026 are in the first camp and hire as if they were in the second, which is one of the more expensive mistakes I watch teams make.

I have sat on both sides of this. I started as an engineer wiring models into products before they were called AI products, and I now run revenue at Devlyn, where I have signed off on the budget for the wrong hire and watched it cost a quarter. The title that sounds more impressive is rarely the title your problem actually needs.

This piece is about telling the two roles apart so you hire for the problem in front of you, not the org chart you imagine you will have in three years. It sits under my broader guide to hiring AI engineers.

The title that sounds more impressive is rarely the title your problem actually needs.

Key takeaways

If you read nothing else, these are the load-bearing claims:

The core difference between an AI engineer and an ML engineer is build vs wire. ML engineers train models from data; AI engineers integrate existing models into products.
The field split because foundation models ate the middle. Once GPT and Claude were good enough to use directly, "use the model" became a full job distinct from "build the model."
Most production AI work in 2026 is integration, not training. RAG, evals, latency, and the seams between model and software are AI-engineer territory, and that is where most value ships.
Hire for the problem, not the resume. An ML engineer on a RAG wiring job is overqualified and bored; an AI engineer asked to train a custom model from scratch will stall. Both mismatches are expensive.
The titles blur, so interview for the actual work. Job postings use the labels interchangeably; the only reliable signal is what the person has actually shipped.

The core distinction: wiring models in versus building them

The cleanest way to hold the distinction is by where the model comes from. A machine learning engineer's job centers on producing a model: gathering and cleaning data, choosing an architecture, training, tuning, and deploying something that did not exist before. An AI engineer's job centers on producing a product feature using a model that already exists, almost always a large foundation model accessed through an API or self-hosted.

Towards Data Science draws the same line cleanly: an AI engineer "specialises in the use and integration of foundational GenAI models such as Claude, GPT, BERT, and others" and does not build those models, while a machine learning engineer "builds algorithms from scratch that focus on more specific tasks" (Towards Data Science). That is the whole distinction in one sentence: ML engineers create models, AI engineers deploy them.

This split is recent, and it is worth understanding why it happened. For most of the 2010s there was no meaningful difference, because if you wanted AI in your product you trained the model yourself. There was no "use the model" job, because the models you would have used did not exist or were not good enough. Then foundation models arrived and got good enough to use directly for a huge range of tasks, and the act of using one became substantial enough to be a full role.

So foundation models ate the middle of the field. The work of building a custom classifier for a narrow task, which used to be a large fraction of applied ML, got absorbed by a prompt to a general model that was already better than what most teams would have trained. What was left split into two ends: the deep model-building work that genuinely needs custom training, and the product-integration work of getting a great general model to behave reliably inside a real application. The first end is ML engineering; the second is AI engineering.

What an AI engineer actually does day to day

An AI engineer spends most of their time on the seams between a model and a product. The model is a given; the hard part is everything around it. That means prompt design and prompt management, retrieval-augmented generation so the model answers from your data instead of its training set, tool and function calling, and the orchestration logic that chains several model calls into a workflow.

It also means the unglamorous reliability work that decides whether an AI feature survives contact with real users. Latency budgets, because a model that is correct but takes six seconds will lose you conversions. Fallbacks for when the API rate-limits or returns garbage, and cost control, because token spend compounds quietly until it shows up in a margin review. And evals, because you cannot ship an AI feature responsibly without a way to measure whether it is getting better or worse, which I cover in my guide to LLM evaluation.

Notice what is not on that list: training a model. An AI engineer rarely touches a training loop. They might fine-tune occasionally, but their core competency is software engineering applied to a probabilistic component they did not build. The best AI engineers are strong software engineers first who understand model behavior well enough to design around its failure modes.

If you want the full picture of that skill set, I went deep on it in the skills that actually matter for an AI engineer.

What an ML engineer actually does day to day

An ML engineer spends most of their time producing and maintaining models. That starts with data: sourcing it, cleaning it, building the pipelines that feed training, and labeling or supervising the labeling of it. Bad data is the most common reason a model underperforms, so a real ML engineer is as much a data engineer as a modeler.

Then comes the modeling itself: choosing an architecture, training, hyperparameter tuning, evaluating against held-out sets, and iterating until the metrics clear the bar. This is genuinely different work from prompt engineering. It requires comfort with the math of optimization, an intuition for why a loss curve is misbehaving, and the patience to run experiments that take hours or days to come back.

The third pillar is MLOps: getting a trained model into production and keeping it healthy. Versioning models and datasets, monitoring for drift as the world changes under the model, retraining on a schedule, and managing the serving infrastructure. An ML engineer's product is a model that keeps performing in production, not a feature a user clicks.

One honest caveat: many people with "ML engineer" on their badge today actually do AI-engineering work, because that is where the demand went. The title lagged the job. Which brings us to the part that confuses everyone.

Skills and tools that diverge

The clearest way to separate the roles in an interview is to look at the stack each one lives in, because the tools rarely lie even when the titles do.

An AI engineer's stack is mostly application software with a model-shaped hole in it: an orchestration framework or a hand-rolled equivalent, a vector database for retrieval, an evals harness, an observability layer for tracing model calls and tracking cost, and the normal apparatus of a production web service. Their KPIs are product KPIs: task resolution rate, latency at the 95th percentile, cost per resolved request, user satisfaction.

An ML engineer's stack is mostly training and data infrastructure: a deep-learning framework, experiment tracking, a feature store, data pipeline tooling, and GPU orchestration for training jobs. Their KPIs are model KPIs: accuracy, precision and recall on a held-out set, training cost and time, and how slowly the model degrades in production before it needs a retrain.

There is real overlap in the middle. Both roles need solid software engineering, both need to understand evaluation deeply, and both increasingly need to reason about inference cost. The economics of running models matters to both, which is why I argue the case for shipping smaller models regardless of which role owns the deployment. But the center of gravity is different, and that difference is what you are hiring for.

A comparison you can paste into a hiring doc

Here is the same distinction in one table, organized by the dimensions that actually matter when you are writing a job description or scoping a hire.

Dimension	AI engineer	ML engineer
Core job	Wire an existing model into a product	Build and train a model from data
Where the model comes from	Foundation model via API or self-hosted	Trained in-house on your data
Primary skills	Software engineering, RAG, prompting, evals, orchestration	Statistics, model architecture, training, data pipelines, MLOps
Typical stack	Vector DB, evals harness, observability, app framework	Deep-learning framework, experiment tracking, feature store, GPU orchestration
KPIs	Task resolution, p95 latency, cost per request, satisfaction	Accuracy, precision/recall, training cost, drift rate
Touches a training loop?	Rarely	Constantly
Hire when	You ship a feature on top of a general model	You have proprietary data and a prediction problem off-the-shelf models miss

Why the titles blur, and the overlap that confuses everyone

If the distinction is so clean, why does every job board treat the labels as synonyms? Three reasons, and they compound.

First, the titles lag the work. "ML engineer" was the established title for years, so when the job shifted toward integration, the title stuck around even as the day-to-day changed. A lot of people called ML engineers in 2026 are doing what I would call AI engineering, and many great AI engineers came up through ML-engineer roles. The badge tells you the career path, not the current job.

Second, there is genuine overlap at the senior end. A staff-level person on a real AI product often spans both: they will fine-tune a model when the evals justify it, and they will wire it into the product, and they will own the observability. The roles are distinct at the median and blurry at the top, which is normal for any maturing discipline.

Third, the data scientist confusion bleeds in. Data scientists, ML engineers, and AI engineers form a spectrum from analysis to model-building to product-integration, and companies draw the boundaries differently. The practical consequence is that you cannot trust a title on a resume. You have to interview for the actual work the person has shipped, which is the only signal that survives the labeling chaos.

The roles are distinct at the median and blurry at the top, which is normal for any maturing discipline.

Which one to hire for which problem

This is the question that actually has money attached, so let me be direct. Ask what produces the value in your product. If the value comes from a general capability that a foundation model already has, and your job is to make it reliable, grounded, fast, and safe inside your application, hire an AI engineer. That is the majority of AI products being built right now.

If the value comes from a pattern that lives only in your proprietary data, something no general model has seen, and you need a model trained specifically on it, hire an ML engineer. Demand forecasting on your specific supply chain, fraud detection on your transaction patterns, a recommendation model on your catalog and behavior data. These are real ML-engineering problems and an AI engineer will not solve them well.

Let me make the mismatch concrete with a couple of NDA-safe composites, the kind of situation I have watched play out more than once. A team raises a round, decides AI is strategic, and hires a strong ML engineer with a PhD and a publication record, while the actual roadmap is a support assistant built on RAG over their help docs. The ML engineer is overqualified for the wiring and under-equipped for the product-integration reliability work, gets bored within two months, and the assistant ships late because nobody on the team has done evals on a generative system before. The hire was excellent and wrong.

The reverse happens too. A team hires a capable AI engineer and then hands them a genuine modeling problem, a custom pricing model on proprietary transaction data, because "they do AI." The AI engineer reaches for a foundation model, the foundation model has never seen the data distribution, the results are mediocre, and three months go by before anyone admits the problem needed training, not prompting. Same lesson from the other direction: the role has to match the problem, not the buzzword.

If you are early and can afford exactly one hire, the default in 2026 is an AI engineer, because most first AI features are integration problems. Bring in ML-engineering depth when you have confirmed a modeling problem that off-the-shelf models genuinely cannot solve. If you want help pressure-testing which one your roadmap actually needs, the Devlyn team scopes this with you before you write the job description.

Cost and availability in 2026

Both roles are expensive and both are scarce, but the markets behave differently. ML engineers command strong salaries because the supply of people who can genuinely train and ship models is limited and has been for a decade. Built In puts the average US machine learning engineer base salary at $162,080 in 2026, with additional cash compensation around $49,942 and total compensation averaging roughly $212,022 (Built In). Senior and specialized ML talent runs well above that.

AI engineering is a younger market and noisier. The title is newer, so the candidate pool is a mix of genuine integration specialists and software engineers who added "AI" to their profile after shipping one feature. Compensation overlaps heavily with ML engineering at the senior end, and the variance is wider because the skill bar is harder to read from a resume. That variance is exactly why interviewing for shipped work matters more here than for almost any other role.

The availability story has a strategic angle. Because most teams default to chasing ML-engineer prestige hires, the integration talent that most products actually need is sometimes easier to find and slightly less contested, if you know to look for it specifically. For a fuller breakdown of the numbers and the total cost of getting this hire wrong, I wrote up what an AI engineer actually costs separately.

The deeper point is that the cost that hurts is not the salary. It is the wrong hire: a quarter lost to a mismatched skill set, a feature that shipped late or not at all, and a strong engineer who leaves because the work was not what they signed up for. That cost dwarfs the salary delta between the two roles, which is why getting the role-to-problem match right is the single most important decision in the whole process.

The cost that hurts is not the salary. It is the wrong hire on the wrong problem.

How you think about this hire is part of a larger shift in what engineering organizations look like now, which is the subject of my book Building an AI-Native Team. The roles, the seams between them, and the judgment to put the right person on the right problem are the whole game.

Frequently asked questions

What is the main difference between an AI engineer and an ML engineer? An ML engineer builds and trains models from data; an AI engineer integrates existing models, usually large foundation models, into a working product. ML engineers create models, AI engineers deploy them. If your value comes from a general capability a foundation model already has, you need an AI engineer; if it comes from a pattern only in your proprietary data, you need an ML engineer.

Is an AI engineer the same as a machine learning engineer? No, though the titles are used loosely and the work overlaps at the senior end. The reliable test is what the person has shipped: a trained-from-scratch model in production points to ML engineering, while a reliable feature built on a foundation model with RAG and evals points to AI engineering. Interview for the actual work, not the label on the resume.

Which should I hire first for an AI feature? In 2026, an AI engineer is the right default, because most first AI features are integration problems solved with an existing model rather than custom-training problems. Bring in ML-engineering depth once you have confirmed a modeling problem that off-the-shelf models genuinely cannot solve. Hiring the wrong one first is one of the most expensive mistakes in the process.

Do AI engineers need to know how to train models? Not deeply. An AI engineer should understand model behavior well enough to design around its failure modes and to know when a problem actually requires training, but their core competency is software engineering applied to a probabilistic component they did not build. Training from scratch is ML-engineering territory.

If you would rather have a team scope the role to your actual roadmap and place the right engineer on the right problem, that is exactly what Devlyn's engineer placement is for. Hire for the problem in front of you. Ignore the title that sounds impressive.

AI Engineer vs Data Scientist: Who to Hire When

Alpesh Nakrani — Tue, 21 Apr 2026 18:30:00 GMT

An AI engineer ships AI features into your product; a data scientist extracts insight and builds the models behind decisions. Here is which one to hire when.

The core difference in the AI engineer vs data scientist question is what they are paid to produce: an AI engineer ships AI features into your product, and a data scientist extracts insight from your data and builds the models behind a decision. So hire the AI engineer when you need an AI capability live in front of users next quarter, and hire the data scientist when you need to understand why a number moved or whether a bet is worth making. One builds the thing customers touch. The other builds the understanding the business runs on.

I have hired into both seats, managed both, and watched both go wrong, usually because someone hired the title they had heard of rather than the work they actually needed. I have also sat in the seat where a wrong answer became a customer-service problem in a physical store, which is a fast way to learn that "data person" is not a job description. This piece is the employer-side version of the comparison: not what each role is in the abstract, but which one to put on which problem, and what it costs you when you swap them. It sits under my broader take on hiring AI engineers for judgment, not a resume, which is the pillar this argument hangs off.

If you already know you need the modeling-and-insight seat and want it filled by people who ship decisions rather than notebooks, that is exactly the work Devlyn's data scientists do. The rest of this is how to know which seat you are actually hiring for.

Key takeaway: An AI engineer ships AI features into the product; a data scientist extracts insight and builds the models behind decisions. Hire to ship, or hire to learn.
The titles overlap; the deliverables do not. Both touch models and data. One ends in a shipped feature with latency and error budgets; the other ends in an answer, a model, or a recommendation a human acts on.
Most org-chart pain is a mismatch. Hiring a data scientist to ship a feature, or an AI engineer to find out why retention dropped, burns a quarter before anyone admits the seat was wrong.
You usually need one first, not both. Start with whichever your nearest problem demands, and let the second hire be pulled by a problem you can name, not by a template org chart.
The overlap is real where model quality lives. Evaluation, data quality, and "is this model good enough" is the seam both roles work, and it is where they collaborate instead of compete.

The core distinction in one sentence

A data scientist turns data into understanding; an AI engineer turns a model into a product. That is the whole thing, and almost every confusion downstream comes from forgetting it.

A data scientist's output is an answer or a model that informs a decision: whether a pricing change will lift revenue or just shift it, which customers are about to churn and what they have in common, whether the lift you saw is real or noise. The deliverable is insight a human acts on, or a model that scores something, and it is judged on whether it is correct, defensible, and useful to the decision-maker. The work is statistical and experimental at its core.

An AI engineer's output is a working feature with users, latency, and an error budget: a support assistant that answers from your docs, a recommendation surface that loads in under two seconds, a pipeline that extracts structured data from messy intake forms reliably enough to trust. The deliverable is software that runs in production, judged on whether it ships, holds up under real traffic, and fails gracefully when the model is wrong. The work is engineering at its core, with the model as one component among many.

The reason the AI engineer vs data scientist line blurs is that both touch models and both touch data. But notice the tense and the surface: the scientist's work mostly informs something a human will do, while the engineer's work is the thing the user does. That difference, decision support versus product surface, is the one that should drive who you hire.

AI engineer vs data scientist: what each does day to day

Abstractions hide the difference, so here is the concrete version. The clearest signal of which role you need is which of these two ticket lists looks like the work you have.

A data scientist's week looks like: pulling and cleaning data from three systems that disagree with each other, framing a fuzzy business question as something measurable, running an experiment or a regression, building or retraining a churn or forecasting model, and then, the part that actually earns the salary, explaining to a non-technical stakeholder what the result means and what it does not mean. A lot of the job is saying "the data cannot answer that yet, here is what we would need." It is closer to applied science with a deadline than to software development.

An AI engineer's week looks like: wiring an LLM or a model behind an API, building the retrieval and prompt scaffolding around it, handling streaming, timeouts, and the case where the model returns garbage, adding the auth, permissions, and logging that make a feature shippable, and watching production dashboards for the latency spike or the cost blowout. The model is rarely the hard part. The hard part is everything around the model that turns a demo into something a customer can rely on. It is software engineering where one dependency happens to be probabilistic.

Here is the test I use. If the deliverable can live in a deck, a dashboard, or a notebook and still create value, you are describing a data scientist. If the deliverable only creates value once it is running in your product, behind a login, in front of a user, you are describing an AI engineer. The seam in the middle, "build a model and also ship it as a feature," is where teams convince themselves one person does both, and where they usually find out otherwise.

If the deliverable can live in a deck, a dashboard, or a notebook and still create value, you want a data scientist. If it only creates value running in your product in front of a user, you want an AI engineer.

Skills and tools, side by side

The skill sets overlap at the edges and diverge hard at the center. Both are comfortable with Python and with the idea of a model. After that they pull apart, and the divergence tells you what each one will be good at when the pressure is on.

A data scientist's center of gravity is statistics and inference: experimental design, hypothesis testing, knowing when a result is significant and when it is a story you are telling yourself, feature engineering, and the modeling libraries that go with that, the pandas-and-notebooks-and-SQL world, plus enough visualization to make a finding land with someone who will not read the code. The deep skill is framing and skepticism: turning a vague question into a measurable one and refusing to over-claim the answer. I cover the engineering side of that judgment in more depth in the AI engineer skills breakdown.

An AI engineer's center of gravity is production software: APIs, data stores, queues and background jobs, observability, the deployment and release machinery, and the specific craft of building around a model that is non-deterministic, retrieval, prompt and context construction, fallbacks, evaluation of model output, and cost and latency control. The deep skill is systems judgment: knowing what breaks at scale and designing so it fails safely. The model is a component; the system is the job.

A comparison table you can paste into a hiring doc

Here is the same split in one place. When you are writing the role or arguing for a headcount, this is the difference between AI engineer and data scientist in the terms a hiring decision actually turns on.

Dimension	Data scientist	AI engineer
Primary output	Insight, a model, a recommendation	A shipped feature running in production
Judged on	Correctness and usefulness of the answer	Whether it ships and holds up under traffic
Core skill	Statistics, experiment design, framing	Production software, systems design
Where it lives	Deck, dashboard, notebook, model file	Your product, behind a login
Failure looks like	A wrong or over-claimed conclusion	An outage, a latency spike, a cost blowout
Hire to	Learn something / decide something	Ship an AI capability to users
Nearest question	"Why did this happen / what should we do?"	"How do we put this in front of users?"

The O-NET occupational profile defines a data scientist as someone who develops techniques to "transform raw data into meaningful information" (O*NET 15-2051.00), which is a precise way of saying the deliverable is meaning, not a deployed system. Hold that next to the AI engineer row and the divergence is the whole point.

Where the two roles overlap, and where org charts get confused

The overlap is real and it sits in one place: model quality. The question "is this model good enough to rely on" is owned by both, from different angles, the data scientist asking it statistically (is the model accurate and unbiased on the population it scores) and the AI engineer asking it operationally (is it good enough in production once latency, cost, and edge cases are priced in). When both seats exist, this is where they collaborate well, and it is why LLM evaluation tends to be the most productive shared territory rather than a turf fight.

The confusion is also real, and it is expensive. The most common mistake I see is hiring for the title and assigning the other role's work: a team hires a brilliant data scientist and then asks them to ship a production AI feature, and three months later there is an excellent model wrapped in a fragile demo that cannot survive real traffic, because productionizing was never their craft. The mirror version is just as common, a team hiring a strong AI engineer and then asking them to explain why retention dropped last quarter, getting a beautiful dashboard from someone who may not have the statistical training to say whether the pattern in it is signal or noise.

Neither person failed. They were put in the wrong seat. The cost is rarely a single bad sprint; it is a quarter or two of building the wrong thing well, plus the morale hit when a strong hire is quietly judged for not being good at a job they were never hired to do. This is the same root cause behind a lot of vetting failures: the interview tested the title instead of the work.

Neither person failed. They were put in the wrong seat, and the cost is not a bad sprint. It is a quarter or two of building the wrong thing well.

There is a third confusion worth naming: the "full-stack data person" myth. Yes, some people genuinely do both, usually senior generalists who came up before the roles split. But staffing a real product on the assumption that one hire covers both is a bet against the odds. You are far more likely to get someone strong on one side and passable on the other, and "passable" on production engineering is how outages happen.

Data scientist vs AI engineer: which to hire for which problem

Here is the decision, stripped to the problem you actually have. Read these as "if your nearest pain is X, hire Y," and the title sorts itself out.

Hire an AI engineer when the goal is a shipped capability: an AI feature in your app, an LLM-powered workflow your users will touch, a model integrated into production with the reliability, latency, and cost controls that implies. If your sentence ends in "...and put it in front of users," this is your seat. When that is the work, Devlyn's AI application engineers build the full feature slice, not just the model call. This is also the more likely first hire for a product company, which is why when to hire an AI engineer is its own decision.

Hire a data scientist when the goal is understanding or a model that informs a decision: why a metric moved, whether an experiment worked, which customers to target, what a forecast says, whether a bet is worth making. If your sentence ends in "...so we know what to do," this is your seat. The deliverable is a defensible answer, and the value is the decision it unlocks, which is exactly the work Devlyn's data scientists are built for.

A short illustrative example of each, both kept deliberately generic. A retail team wanted a "smart product finder" live for the holiday season; that is an AI engineer problem, because the value only exists once it is running in the store experience at speed. A different team kept guessing why repeat purchases were sliding and arguing about it in meetings; that is a data scientist problem, because the value is a defensible answer that ends the argument. Same word, "AI," wildly different hire.

Do you actually need both?

Eventually, maybe. Right now, probably not, and forcing both early is a common way to burn runway on a role you cannot yet keep busy. The honest answer for most teams under a certain size is: hire the one your nearest, most expensive problem demands, and let the second hire be pulled by a real problem you can name, not by a template org chart that says serious AI teams have both.

The sequencing usually goes like this. If you are a product company trying to put AI in front of users, your first hire is almost always the AI engineer, because shipping is the bottleneck and a half-built feature earns nothing. If you are a business drowning in data and decisions, the data scientist comes first, because you cannot ship the right thing until you know what the right thing is. The second role gets added when the first one keeps hitting a wall that is clearly the other discipline, the engineer who keeps needing rigorous model evaluation they are not trained to do, or the scientist whose excellent models keep dying on the way to production.

One caveat I will own: at real scale, with multiple AI products and a serious data estate, you want both as standing functions, and trying to make one cover both becomes the bottleneck. The point is not "never hire both." It is "do not hire both before a named problem requires it." If you want the longer argument for how these seats fit into a team that ships with AI rather than around it, that is the subject of my book, Building an AI-Native Team. And the budget side of the decision, what each seat actually costs, is its own conversation I work through in what an AI engineer costs.

For a sense of scale on the engineering side, the Stack Overflow 2024 Developer Survey put the US median pay for self-identified AI developers at around the high-$100Ks (Stack Overflow 2024); data scientist comp lands in a broadly similar band, which means the choice between them is almost never about saving money. It is about matching the seat to the work, because that is where the real money is won or lost.

Frequently asked questions

What is the main difference between an AI engineer and a data scientist? An AI engineer ships AI features into your product as working software, judged on whether they run reliably in production. A data scientist extracts insight from data and builds the models behind decisions, judged on whether the answer is correct and useful. One builds what users touch; the other builds what the business understands.

Should I hire an AI engineer or a data scientist first? Hire whichever your nearest expensive problem demands. If the goal is putting an AI capability in front of users, hire the AI engineer first, because shipping is the bottleneck; if the goal is understanding why something is happening or what to do, hire the data scientist first. Add the second only when the first keeps hitting a wall that is clearly the other discipline.

Can one person do both AI engineering and data science? Some senior generalists genuinely can, but staffing a real product on that assumption is a bet against the odds. You are far more likely to get someone strong on one side and passable on the other, and "passable" on production engineering is how outages happen. Hire for the side your work centers on, and treat dual-strength as a bonus, not a plan.

Do AI engineers and data scientists earn similar salaries? Broadly, yes, both land in a comparable senior-engineer band, with the AI engineer median in the US reported around the high-$100Ks in the Stack Overflow 2024 survey and data scientist comp in a similar range. Because the pay is close, the decision between them is about fit to the work, not cost, which is the whole reason getting the seat right matters so much.

If you can already name which seat your problem needs, the next step is filling it with people who ship rather than prototype. Devlyn's data scientists turn models into decisions, and Devlyn's AI application engineers turn models into features your users can rely on. Match the seat to the work, and the hire stops being a gamble.

AI Engineer vs Software Engineer: The Real Difference

Alpesh Nakrani — Mon, 20 Apr 2026 18:30:00 GMT

AI engineer vs software engineer: one builds deterministic systems you can test, the other builds probabilistic systems you have to evaluate. Who to hire when.

The core of ai engineer vs software engineer comes down to one property: a software engineer builds deterministic systems where the same input always produces the same output, and an AI engineer builds probabilistic systems on top of foundation models where the same input can produce a different output every time. If your problem has a correct answer you can specify in advance, hire a software engineer. If your problem requires judging an open-ended output that no test can fully pin down, you need an AI engineer. Most teams in 2026 reach for the wrong one because the two roles look identical on a resume and behave nothing alike in production.

I am writing this from the hiring side. I started as an engineer wiring systems together, and I now run revenue at Devlyn, which means I sit in two seats at once: I read the traces and I read the P&L. I have hired more than 80 senior AI engineers and shipped over 200 products on top of them. The most expensive mistake I watch teams make is treating an AI engineer as a software engineer who also knows some machine learning, then being surprised when the deterministic instincts that made the person great break down the first time a model says something confidently wrong.

This piece is about telling the two apart so you hire for the problem in front of you. It sits under my broader guide to hiring AI engineers. If you already have an AI feature stalling in your stack and want people who have shipped this before, Devlyn's AI application engineers do exactly this work.

A software engineer proves the system is correct. An AI engineer proves the system is correct often enough, and knows what to do about the rest.

Key takeaways

If you read nothing else, these are the load-bearing claims:

The core distinction is deterministic vs probabilistic. Software engineering produces the same output for the same input; AI engineering does not, and that single fact reshapes the whole job.
The dividing skill is evaluation, not coding. A software engineer proves correctness with tests; an AI engineer proves it with evals and judgment, because you cannot unit-test a language model.
Most skills transfer; the dangerous ones do not. System design, APIs, and observability carry over cleanly. The instinct that "it passes, so it works" is the thing that gets a senior software engineer in trouble.
Hire by problem, not by title. A correct-answer problem wants a software engineer; an open-ended, model-driven problem wants an AI engineer.
Yes, a software engineer can become an AI engineer. The hard part is a mindset shift toward probabilistic thinking, not a stack of new frameworks.

The core difference: deterministic software vs probabilistic AI

Traditional software engineering is the practice of building deterministic systems. Given the same input, the code returns the same output, every time, forever. That property is the foundation of everything a software engineer relies on: you can write a test that asserts a specific result, run it on every commit, and trust that a green check means the behavior is correct. Correctness is a thing you specify, then verify.

AI engineering breaks that foundation in one specific place. The system still has a lot of ordinary deterministic software around it, but a load-bearing component in the middle is a foundation model, and foundation models are probabilistic. The same prompt can produce a different answer on two consecutive calls, and even with temperature pinned to zero there is no hard guarantee of identical output. AI engineering is, in Chip Huyen's framing, the process of building applications with foundation models, and the discipline exists precisely because that probabilistic core demands a different kind of engineering around it.

This is not a small technical footnote, because it changes what "done" means. A software engineer ships when the tests pass. An AI engineer cannot ship on a passing test, because there is no single correct output to assert against. They ship when the system is right often enough, on the cases that matter, measured against a held-out set they trust.

What a software engineer actually does

A software engineer designs, builds, and maintains systems with explicit logic. They translate requirements into code that does a specific, knowable thing: process a payment, render a page, validate a form, move a record through a state machine. The languages and frameworks vary, but the underlying contract is constant. The system has defined inputs, defined outputs, and a specification that says which is which.

The craft of software engineering is largely about managing that determinism at scale: architecture that stays maintainable as the codebase grows, tests that catch regressions, observability that surfaces failures, and abstractions that let a team move without stepping on each other. When something breaks, there is a root cause, and a good engineer can trace it. The bug is in the code, the config, or the data, and it is, in principle, findable and fixable.

That is the world a senior software engineer has mastered. It is a world where correctness is provable and failure is explainable. None of that goes away in AI work. It just stops being sufficient.

What an AI engineer actually does

An AI engineer takes a foundation model that already exists and turns it into a feature a customer can use. They do not train models from scratch, and they are not research scientists. They wrap a model in retrieval, tool calls, guardrails, evals, and product logic, and they make the whole probabilistic thing behave well enough to ship. The closest sibling role is the one I cover in what an AI engineer is, and the role they get confused with most often is the one in AI engineer vs ML engineer.

The daily work looks different from software engineering in a way that surprises people. A large share of an AI engineer's time goes into looking at outputs, not writing code. They read what the model produced, decide whether it was acceptable, figure out why it failed when it failed, and adjust the prompt, the retrieval, or the guardrail accordingly. The same instinct shows up in agentic coding work, where an engineer supervises a model that writes code rather than writing all of it themselves.

The hardest part is that there is no compiler for judgment. When a model returns an answer, nothing tells you it is wrong. The engineer has to know it is wrong, which means they need enough domain understanding to evaluate the output and enough discipline to do it systematically rather than by vibe. That evaluation work is the actual job, and it is the part that does not show up in a job description full of framework names.

A large share of an AI engineer's day is spent reading model outputs and deciding whether they are good enough. That is not overhead. That is the work.

Skills that transfer, and skills that don't

The good news for any software engineer eyeing this move is that most of what makes you a strong engineer carries over cleanly. System design transfers, API integration transfers, and so do observability, testing discipline, version control, handling failure gracefully, and writing code other people can maintain. An AI feature is still mostly software, and a probabilistic core does not excuse you from building the deterministic parts well. A weak engineer does not become a strong AI engineer by learning prompt syntax.

The dangerous part is the skill that does not transfer, and it is an instinct rather than a technique. A senior software engineer carries a deep, mostly unconscious belief that if the code passes its tests, the system works, and in AI work that belief is a trap. A model output can pass every check you wrote and still be wrong on the case you did not think to write, because the failure space is open-ended rather than enumerable. The engineers who struggle most in the transition are often the most senior ones, precisely because their hard-won deterministic instincts fire confidently in a domain where those instincts mislead.

So the skills to actually add are narrower than the tutorials suggest. You need to understand how foundation models behave, including how they fail, and you need to think in distributions and probabilities rather than single cases. Above all you need to build and trust evals as your primary instrument of correctness, the way you once trusted unit tests. The rest, retrieval, tool calls, the specific orchestration libraries, is learnable in weeks, and I go deeper on the full set in AI engineer skills.

Can a software engineer become an AI engineer?

Yes, and it is one of the most common transitions in the field right now. The demand is real: AI engineering roles are projected to grow 26 percent between 2023 and 2033, against a 4 percent average for all occupations, on Bureau of Labor Statistics figures. A strong software engineer already holds most of the foundation, which is why the jump is faster than the jump into, say, ML research.

But I want to be honest about what the jump actually requires, because the roadmap content tends to oversell the tooling and undersell the rework. The hard part is not learning a vector database or a model API. The hard part is unlearning the assumption that correctness is something you specify once and verify forever. The engineers who make the transition well are the ones who get comfortable saying "this is right 94 percent of the time and here is what happens on the other 6 percent" instead of needing a green check before they feel safe.

In practice the fastest path is to take a real feature, ship it on a foundation model, and force yourself to build the eval that proves it works. The first time you watch a model pass your tests and fail a customer, the mindset shift happens on its own. That single experience teaches more than any course, because it puts the probabilistic reality in front of you in a way no tutorial can.

AI engineer vs software engineer: a comparison you can paste into a planning doc

Here is the software engineer vs ai engineer distinction in one table, organized around the parts that actually change how the work feels day to day.

Dimension	Software Engineer	AI Engineer
System type	Deterministic: same input, same output	Probabilistic: same input, output can vary
Core material	Explicit logic and code they write	Foundation models they wire into a product
How correctness is proven	Unit and integration tests	Evals against a held-out set, plus judgment
"Done" means	Tests pass	Right often enough on the cases that matter
Main daily activity	Writing and maintaining code	Reading outputs, diagnosing failures, tuning
Failure mode	A traceable bug with a root cause	A fluent, confident, wrong answer
Hire when	The problem has a specifiable correct answer	The problem is open-ended and model-driven

Which one should you hire?

Hire by the shape of the problem, not the title that sounds more current. If what you are building has a correct answer you can specify, a payment flow, a dashboard, an integration, a CRUD app with real users, you want a software engineer, and bolting an AI title onto the req will just cost you more for a worse fit. Plenty of "AI" features are actually deterministic software with a thin model call, and a good software engineer ships them faster than a specialist.

You want an AI engineer when the value of the product lives in a model's open-ended output, and getting that output to behave is the actual hard part. A support assistant that has to be right in front of a frustrated customer, a retrieval system that has to ground its answers, an agent that takes actions, all of these fail in the probabilistic ways that a software engineer's toolkit does not address. The signal is simple: if your biggest risk is "the model says something wrong and we don't catch it," you need someone whose entire discipline is built around catching it. That discipline is the same one I describe in my guide to LLM evaluation.

One illustrative pattern I have seen more than once: a team hires a brilliant senior software engineer for an AI feature, the engineer ships something that demos beautifully, and three weeks later support tickets pile up because the model fails on inputs nobody tested. The engineer was not bad; the match was. The fix was not a better software engineer; it was someone who would have built the eval before the demo. If you are weighing that decision now, Devlyn's AI application engineers are people who default to the eval, because they have paid the cost of skipping it.

Frequently asked questions

What is the main difference between an AI engineer and a software engineer?

A software engineer builds deterministic systems where the same input always yields the same output, so correctness is proven with tests. An AI engineer builds probabilistic systems on top of foundation models, where the same input can yield different outputs, so correctness is proven with evals and human judgment. Everything else about the role difference follows from that one fact.

Do AI engineers and software engineers use the same skills?

They share most of them. System design, API work, observability, and testing discipline transfer directly, because an AI feature is still mostly ordinary software. The skills an AI engineer adds are understanding how foundation models fail, thinking in probabilities rather than single cases, and building evals as the primary measure of correctness. The one instinct that does not transfer is the belief that passing tests means the system works.

Can a software engineer become an AI engineer?

Yes, and it is a common and increasingly fast transition. A strong software engineer already holds most of the foundation, so the work is less about new frameworks and more about a mindset shift toward probabilistic thinking. The fastest path is to ship a real feature on a foundation model and build the eval that proves it works, because the first time a model passes your tests and fails a customer, the shift happens on its own.

Should I hire an AI engineer or a software engineer for my product?

Hire by the problem. If the feature has a specifiable correct answer, a software engineer is the right and cheaper choice, even if there is a model call somewhere in it. If the product's value lives in an open-ended model output and the main risk is the model being confidently wrong, hire an AI engineer whose discipline is built around catching exactly that.

If you want the full hiring picture, including roles, costs, and how the bad hires fail, start with my pillar guide to hiring AI engineers, and the team-building side is in my book Building the AI-Native Team. And if you would rather skip the misfit hire entirely and bring in people who have shipped probabilistic systems to real customers, that is exactly what Devlyn's AI application engineers are for. Hire for the problem, not the title.

What Is an LLM Engineer? The Role, Explained for Hirers

Alpesh Nakrani — Sun, 19 Apr 2026 18:30:00 GMT

What is an LLM engineer? The specialist who turns foundation models into reliable production features. Here is the role, what they do, and when to hire.

An LLM engineer is the person who takes a large language model someone else trained and turns it into a feature your customers can rely on. They do not build the model. They build everything around it: the prompts that constrain it, the retrieval that grounds it, the evals that catch it when it drifts, and the serving layer that keeps it fast and affordable under real load. That is the role in one sentence, and most of what gets written about it buries that sentence under a pile of framework names.

I am writing this from the hiring side of the table. I started as an engineer and now run conversion and revenue, which means I read the traces and I read the P&L in the same week. At Devlyn I have hired specialists for exactly this work and shipped products on top of them. So when someone asks me what an LLM engineer is, I am not answering from a job-board template. I am answering from the cost of getting the definition wrong, because the wrong definition is how a company pays a senior salary for the wrong skill set.

The title is new, it overlaps with two other roles that have "AI" or "ML" in them, and the field changes weekly. Hire an LLM engineer expecting a research scientist, or a data scientist expecting an LLM engineer, and the work does not get done while the money is already spent. This piece is the definition I wish more hiring managers had before they opened the req: what the role is, what it does all day, how it differs from the adjacent roles, and when your company actually needs one. If you already know you need the role and just want people who have shipped it, Devlyn places LLM engineers vetted on production judgment rather than a tool list.

An LLM engineer builds on models, they do not train them. Their output is a working production feature on top of an existing foundation model, not a new model and not a paper.
The job is defined by what ships, not by tools. The model names, the vector store, the orchestration library are all learnable in a week. Production judgment is the scarce part.
It is a distinct role from AI engineer and ML engineer. Narrower than "AI engineer," and almost the opposite of "ML engineer." Confusing the three is the most expensive hiring mistake in this category.
The core work is prompting, retrieval, evals, fine-tuning, and serving. Most of the value lives in retrieval and evals, not in clever prompts.
You do not need one until an LLM feature goes in front of customers and the cost of it being wrong is real. Before that, you are hiring ahead of the problem.

What an LLM engineer actually is

Start with what the model already gives you and what it does not. A foundation model arrives knowing a great deal about language and almost nothing about your business, your data, or your customers. It will answer confidently when it should refuse, it will invent facts it was never given, and it will cost you money on every token whether the answer was right or not. None of those problems are solved by picking a bigger model. They are solved by engineering.

That engineering is the job. An LLM engineer wraps the model in the machinery that makes it safe to put in front of a paying customer: instructions that constrain its behavior, a retrieval layer that feeds it your real data, a set of evals that measure whether it is getting better or worse, and an inference path tuned for latency and cost. The model is an input to that system. The system is the product, and the LLM engineer is the person who owns it.

This is why the role is defined by what ships and not by which tools the person has touched. I have interviewed candidates who could recite every orchestration framework released in the last year and could not tell me how they would know their RAG system had quietly started returning stale documents. I have hired others who had used only two libraries and immediately asked what the failure cost was if the model was wrong in front of a customer. The second kind ships products that hold up. If you want the longer version of that argument, the LLM engineers we place at Devlyn are selected on exactly that judgment, not on a tool checklist.

What an LLM engineer does all day

The day-to-day breaks into five kinds of work, and a good LLM engineer moves between all of them. None of them is the glamorous "talk to the AI" part that the title implies.

Prompting and context engineering. This is writing the instructions and assembling the context the model sees on every call. It sounds soft and it is not. A production prompt is an interface contract: it defines what the model is allowed to do, what it must refuse, and what shape the output takes so the rest of the system can parse it. The skill is in constraining behavior and handling the failure cases, not in clever phrasing.

Retrieval, or RAG. Most LLM products are retrieval products wearing a chat interface. Retrieval-augmented generation is, in NVIDIA's words, "a technique for enhancing the accuracy and reliability of generative AI models with information fetched from specific and relevant data sources" (NVIDIA). The engineer chunks your documents, indexes them in a vector store, retrieves the relevant ones at query time, and feeds them to the model. When RAG goes wrong, it goes wrong quietly: the retrieval returns the wrong passage and the model answers fluently from it. Catching that is half the job, and it is the half that deciding between RAG and fine-tuning turns on.

Evals. An eval is a test suite for a system that does not give the same answer twice. The engineer builds a frozen set of representative inputs, defines what a good output looks like, and scores every model and prompt change against it. Faithfulness is a typical eval metric: RAGAS defines it as the ratio of claims in an answer that are actually supported by the retrieved context (Ragas docs). Without evals, "the model got better" is a vibe. With them, it is a number you can defend. This is the discipline I cover in depth in my guide to LLM evaluation.

Fine-tuning, when it earns its place. Fine-tuning adapts a pre-trained model to a narrower task on your data. It is powerful and it is overused. A strong LLM engineer reaches for retrieval and better prompts first and fine-tunes only when the eval numbers say the cheaper options have run out. Knowing the difference is judgment, not a default.

Serving and cost. The feature has to be fast and affordable at the volume you actually run. The engineer manages latency, caching, and the inference bill, because a feature that is correct but slow or unaffordable does not ship. Prompt caching is one of the levers here, and it matters more than most teams realize once traffic climbs.

Most LLM products are retrieval products wearing a chat interface. The clever prompt gets the demo. The retrieval and the evals get the renewal.

The responsibilities, in one table

Here is the same work as a table, so you can match a job description or a candidate against it. The left column is the responsibility. The right column is what it actually looks like when the person is doing it well, not the abstract version.

Responsibility	What it looks like in practice
Prompt & context engineering	Writes prompts as interface contracts: constrains behavior, defines output shape, handles refusals and edge cases
Retrieval (RAG)	Chunks and indexes your data, tunes retrieval quality, catches stale or wrong-passage answers before customers do
Evaluation	Builds a frozen eval set from real traffic, scores every change, reports faithfulness and accuracy as numbers, not vibes
Fine-tuning	Reaches for it only when evals show prompts and retrieval have run out; owns the data and the regression risk
Serving & cost	Manages latency at p95, caching, and the inference bill so the feature is affordable at real volume
Production judgment	Knows the cost of being wrong, decides what "good enough" means, and which 5% of cases will hurt you

LLM engineer vs AI engineer vs ML engineer

This is the question that costs the most when it goes unanswered, so here is the clean version. An ML engineer builds and trains models. Their world is data pipelines, training runs, feature engineering, and model architecture. They produce a model. An AI engineer builds products on top of models someone else trained; I have written the full definition in what an AI engineer is. An LLM engineer is an AI engineer whose models are specifically large language models, and whose daily work centers on prompting, retrieval, and evals rather than, say, computer vision or recommendation systems.

So the relationship is: LLM engineer is a specialization inside AI engineer, and both are nearly the opposite of ML engineer on the build-versus-train axis. An ML engineer who has never shipped a RAG system and an LLM engineer who has never trained a model from scratch are both doing their jobs correctly. They are not interchangeable, and a job post that lists "train LLMs from scratch" alongside "build our chatbot" is describing two different hires. The AI engineer versus ML engineer split covers the build-versus-train distinction in more depth.

I watched a team burn most of a quarter on this. They posted for an "LLM engineer" but the req was written by someone who pictured a research scientist, full of training-from-scratch language. The strong application builders self-selected out, the candidates who stayed oversold their research credentials, and the person they hired spent three months trying to justify a fine-tuning project the product never needed. The fix was one rewrite of the job description to describe building, not inventing. The cost was the quarter.

The skills that actually matter

The technical floor is real and most competent candidates clear it: fluency with LLM APIs, RAG, eval design, prompt and context engineering, and enough software engineering to ship a system that does not fall over. That floor is necessary and it is not what separates a good hire from an expensive one. The skill that separates them is judgment, and it is the same judgment that runs through the AI engineer skills that actually matter.

Judgment shows up as three questions the good ones ask without prompting. When is the model wrong, and how would I know? When is "good enough" actually good enough to ship? Which slice of cases, usually a small one, will hurt the business if it fails? A candidate who frames their work around the cost of being wrong has the judgment. A candidate who frames it around the model they used does not, no matter how impressive the model.

This is also why seniority is not about years. A junior engineer can know every framework and still not know which 5% of cases matter. A senior one has been burned enough times to instrument for them first. If that distinction is load-bearing for your req, the senior-versus-junior breakdown is the one to read before you set the level.

When you actually need an LLM engineer

You need an LLM engineer when an LLM-powered feature is going in front of customers and the cost of it being wrong is real. That is the whole test. Before that point, you are hiring ahead of the problem and paying a scarce salary to a person who does not yet have a production problem to solve.

The signal is not "we want to use AI." Everyone wants to use AI. The signal is that you have a specific feature, a real user touching it, and a concrete cost when the model misbehaves: a refund, a compliance exposure, a support ticket, a churned account. When the wrongness has a price tag, you need someone whose entire job is to drive that price down. When it does not yet, a strong generalist engineer with API access will get you to the demo, and you hire the specialist when the demo becomes a product.

I have also watched the opposite mistake, which is cheaper but still real. A team waited too long, ran a customer-facing LLM feature on a generalist's spare time for two quarters, and accumulated a backlog of quiet failures nobody owned because nobody owned evals. The feature looked fine in every demo and leaked trust in production. The day they hired someone to own it, the first month was just measuring how wrong it already was. Hire when the cost is real, not before, and not a year after.

How to hire one

Once you know you need the role, hiring it well is its own discipline, and it is the subject of my definitive guide to hiring AI engineers. The short version: write the req around what ships, not around tools; interview for judgment by asking how the candidate would know the system was wrong; and check that their stated pay expectations match the role you are actually filling rather than the inflated market headline. On that last point, what an AI engineer actually costs is more reliable than the salary aggregators, which vary wildly because they lump three different roles under one title.

If you would rather not run that gauntlet, that is a legitimate choice. Hiring a permanent specialist for a role this new and this fast-moving is a real bet, and a lot of teams are better served by bringing in proven LLM engineers on an engagement and converting later. That is exactly what Devlyn's LLM engineer hiring is for: people who have already shipped retrieval, evals, and serving in production, vetted on judgment rather than a tool list. The framework I use to think about building the team around them is in Building an AI-Native Team.

Frequently asked questions

What is an LLM engineer in simple terms? An LLM engineer takes a large language model someone else trained and turns it into a reliable product feature. They build the prompts, the retrieval, the evals, and the serving layer around the model. They do not train the model itself, and they do not write research papers. Their output is a working feature, not a new model.

What is the difference between an LLM engineer and an AI engineer? An LLM engineer is a specialization within the AI engineer role. Both build products on top of models someone else trained, but the LLM engineer works specifically with large language models and spends their days on prompting, retrieval, and evals. An AI engineer is the broader title that can also cover vision, recommendation, or other model types.

Do LLM engineers need to know machine learning? They need to understand how LLMs behave well enough to debug and constrain them, but they do not need to train models from scratch, which is the ML engineer's job. The most valuable LLM engineering skills are retrieval, eval design, prompt engineering, and the software engineering to ship a system. Deep model-training expertise is a bonus, not a requirement.

When should I hire an LLM engineer? When an LLM-powered feature is going in front of customers and the cost of it being wrong is real, such as a refund, a compliance risk, or a churned account. Before that, a strong generalist engineer with API access can get you to a working demo. You hire the specialist when the demo becomes a product people depend on.

If you have a feature in front of customers and the wrongness now has a price tag, that is the moment the role pays for itself. Devlyn places LLM engineers who have already done this work in production, and the hiring guide walks through running the search yourself. Build on the model. Measure what breaks. Hire the person who does both.

How to Hire an LLM Engineer (and What to Look For)

Alpesh Nakrani — Sat, 18 Apr 2026 18:30:00 GMT

How and where to hire an LLM engineer, the signals to screen for, what it costs, and when to hire through a partner instead of building the loop yourself.

To hire an LLM engineer who will actually ship, screen for someone who treats evals, retrieval debugging, and cost-per-task as first-class work, not afterthoughts, and source them through specialist networks or a partner that pre-vets for production experience rather than a general job board. The fastest path when you cannot vet the candidate yourself is to hire through a partner who can put a pre-vetted senior LLM engineer in front of you in days, not the four-to-five months the open market currently takes.

I have sat on both sides of this. I started as an engineer, and I now run revenue at Devlyn, where I hire and deploy LLM engineers into products that touch paying customers. So I will skip the recruiter platitudes and tell you what separates an LLM engineer who turns a demo into a margin-positive feature from one who burns six months and a quarter-million dollars on a chatbot nobody trusts. This is the LLM-specialist deep dive under my broader guide to hiring AI engineers.

Key takeaway: An LLM engineer is an applied-systems hire, not a research hire. Screen for production judgment, RAG and tool-use debugging, and eval discipline, not model trivia or benchmark scores.
The interview should contain an eval. If your loop is a LeetCode round and a culture chat, you are screening for the wrong job. Give them a messy retrieval failure and watch how they reason.
Cost tracks scarcity, not hype. Senior LLM specialists run roughly $240K-$350K+ base in the US, and the demand-to-supply ratio is about 3.2 to 1, which is why time-to-hire on the open market is months.
The build-vs-partner decision hinges on one question: can you vet this person yourself? If you cannot, hiring through a pre-vetting partner is faster and cheaper than a wrong full-time hire.
The most expensive mistake is hiring the resume instead of the failure mode you cannot tolerate. Define the job by what must not break, then hire against that.

What an LLM engineer actually brings (and how it differs from a general AI engineer)

An LLM engineer builds reliable systems on top of language models. That is the whole job, and the word doing the work is "reliable." The hard part of this role was never calling an API; any competent developer can get a model to respond. The hard part is making it respond correctly, fast enough, and cheaply enough, on the long tail of inputs real users send, every time.

This is where the title gets muddy, so let me be precise. A general AI engineer or ML engineer often comes from a training-and-modeling background: datasets, gradients, model architecture. An LLM engineer works one layer up, in the applied-systems layer, where the model is a fixed component and the engineering is everything around it. If you want the broader taxonomy, I wrote it up in what an AI engineer is and the skills that matter; the short version is that an LLM engineer is the specialist who owns the behavior of the model in production.

Concretely, the work is retrieval pipelines that surface the right context, prompts that hold up under adversarial input, tool calling and structured outputs that downstream code can trust, evals that catch regressions before customers do, and the cost and latency controls that keep the feature affordable at scale. None of that shows up on a benchmark leaderboard. All of it shows up in your support queue when it is done badly.

I have learned to distrust candidates who lead with which models they have used. The model is the least durable part of the stack; it will be replaced twice before the feature is a year old. The durable skill is the system thinking around it.

The skills and signals to screen for

The skill that predicts success in this role better than any other is evals-first thinking. An LLM engineer who reaches for an evaluation set before they reach for a bigger model has internalized the only discipline that makes language-model work tractable. If they cannot tell you how they would measure whether the feature is good, they cannot build a feature that is good, no matter how fluent the demo looks.

The second signal is failure-mode literacy. Ask a candidate what breaks in a RAG system and a strong one will not say "hallucination" and stop. They will walk you through retrieval missing the relevant chunk, the model ignoring retrieved context, chunk boundaries splitting a key fact, and stale embeddings, and they will tell you how they would isolate which one is firing. That diagnostic instinct is the difference between someone who debugs and someone who reruns the prompt and hopes.

The third signal is cost and latency awareness as a product concern, not an afterthought. A real LLM engineer knows that a feature which is 2% more accurate and 600 milliseconds slower at the 95th percentile can lose more revenue than it earns. They think about caching, routing cheap requests to small models, and what a resolution actually costs, because they have shipped something that had to pay for itself.

The fourth signal is simply that they ship. Plenty of people can talk about agents and retrieval beautifully and have never put a language-model feature in front of a user who could leave a bad review. Production experience changes how someone thinks, because production is where you learn that the boring failures, a malformed JSON output at 2 a.m., are the ones that actually hurt. For the full screening playbook, see how to vet AI engineers and the interview questions I lean on.

The model is the least durable part of the stack. Hire for the system thinking around it, not the model name on the resume.

A signal-by-signal screening table you can run

Here is how I turn those signals into an interview. For each one, there is something concrete to test and a clear tell that separates a strong answer from a weak one. Paste this into your hiring doc and run it.

Signal	What to test	Strong vs weak
Evals-first thinking	Give a vague feature ("answer billing questions"); ask how they would know it works	Strong: defines a frozen, production-sampled set and failure modes first. Weak: jumps to model choice or "we'd test it."
Retrieval debugging	Show a RAG answer that is fluent but wrong; ask what they check	Strong: isolates retrieval miss vs context-ignored vs stale index. Weak: blames "hallucination" and swaps the model.
Cost and latency judgment	Ask how they would cut inference cost 50% without hurting quality	Strong: caching, routing, task narrowing, smaller models on the easy tail. Weak: "use a cheaper model everywhere."
Structured output and tool use	Ask how they guarantee downstream code can trust the model output	Strong: schema validation, retries, guardrails, graceful failure. Weak: assumes the model returns clean JSON.
Production scar tissue	"Tell me about an LLM feature that broke in production"	Strong: a specific boring failure and the fix that stuck. Weak: only demo or benchmark stories.
Model-agnostic thinking	Ask what changes in their system if the model is swapped next quarter	Strong: very little, because the eval and scaffolding hold. Weak: the whole thing is tuned to one model's quirks.

The pattern across every row is the same. A strong LLM engineer treats the model as a replaceable input to a system they own; a weak one treats the model as the system. You are hiring for the first kind.

Where to find LLM engineers (and how to vet them)

The supply problem is real, so where you look matters. The strongest applied LLM engineers are rarely scanning general job boards; they are employed, building, and reachable through specialist communities, open-source contributions to eval and retrieval tooling, technical writing, and referrals from people who have shipped with them. A candidate who has published a thoughtful post-mortem on a RAG system going wrong is worth ten who list "LLMs" as a skill.

Wherever you source them, the vetting bar is the same, and it is not a LeetCode loop. Algorithmic puzzles tell you nothing about whether someone can debug a retrieval pipeline or design an eval. The single highest-signal screen is a small, paid take-home built around a realistic failure: here is a retrieval system that returns plausible-but-wrong answers, find out why and propose a fix. How they reason through that tells you more than any whiteboard round.

I watched a team nearly pass on a quiet candidate who fumbled the systems-design trivia, then ace the take-home by writing an eval harness before touching the prompt and catching that the index was chunking mid-sentence. They hired him. He turned out to be the best LLM engineer on the team, precisely because his instinct was to measure before he guessed. The trivia round would have screened him out; the eval-shaped exercise screened him in.

The mirror-image story is the candidate who dazzled in the interview, name-dropped every framework, and shipped a feature that fell apart on real traffic because he had never written a single test against production-sampled inputs. Both stories are composites, but the lesson is not: vet for the discipline, not the vocabulary.

What it costs to hire an LLM engineer

Compensation for this role is high because the talent is genuinely scarce, not because of hype. As of 2026, senior AI engineer base salaries in the US run roughly $180K-$280K, and LLM and generative-AI specialists command a premium on top of that, landing around $240K-$350K+ at the senior level according to the kore1 AI engineer salary guide. That premium is the market pricing the gap between a general engineer and one who can make a language model behave in production. I break the full picture down in what an AI engineer costs.

The scarcity behind those numbers is structural. Across the market there are roughly 3.2 open AI roles for every qualified candidate, and NLP/LLM specialists are rated among the most acute shortages, with demand growing fast year over year, per secondtalent's global AI talent shortage data. That same data puts the global average time-to-hire for these roles near 4.7 months. If you are planning a roadmap around a hire you have not started, that lead time is the number that should worry you.

The cost that gets ignored is the cost of getting it wrong. A failed senior technical hire is commonly estimated at 1.5x to 3x annual salary once you count ramp time, severance, the opportunity cost of the unbuilt roadmap, and the rehire. For a $250K LLM role, that is a $375K to $750K mistake, and it is far more likely when you cannot evaluate the person you are hiring. The expensive part of hiring is not the salary; it is the wrong salary.

One honest caveat on every number here: ranges vary widely by market, level, and how you define the role, and the figures above are external benchmarks, not a quote for your specific hire. Treat them as a frame for the order of magnitude, not a price list.

In-house vs hiring through a partner

The build-vs-partner decision is not about cost first; it is about your ability to vet and the time you have. Hiring a full-time LLM engineer into your own org is the right move when LLM work is core and recurring, when you can credibly evaluate the candidate, and when you can afford to wait months to fill the seat. If all three are true, hire in-house and own the capability. I lay out that trade in detail in in-house vs outsourced AI and when to hire at all.

The case for hiring through a partner gets strong the moment one of those conditions fails. If you cannot confidently vet an LLM engineer yourself, you are making a $250K-plus bet on a skill set you cannot assess, and a partner who has already done the vetting absorbs that risk. If you need someone shipping in weeks rather than months, a pre-vetting partner skips the four-to-five-month open-market search. And if the work is real but not yet a permanent headcount, an embedded specialist lets you move now without committing to a hire you might not need in a year.

This is the gap Devlyn is built to close. If you would rather not run a five-month search and a vetting loop you are not equipped to run, Devlyn can put a pre-vetted senior LLM engineer in front of you in 48 hours, screened for exactly the signals in the table above: RAG, tool use, structured outputs, evals, tracing, routing, and cost controls. You keep the option to convert to full-time once you have seen the work, which is a far safer way to make a senior hire than a resume and three interviews.

The honest version of this advice is that a partner is not always the answer. If LLM work is your core product surface for the next five years and you have the judgment to hire well, building the team yourself is the better long-term play. The partner route wins on speed, vetting risk, and optionality, which is exactly what most teams making their first LLM hire are short on.

The common mistakes hiring for this role

The mistake I see most often is hiring the resume instead of the failure mode. Teams write a job description that lists every fashionable acronym and then interview for keyword coverage, when they should start from the question "what must this feature never get wrong?" and hire the person whose instincts are organized around preventing exactly that. Define the job by the failure you cannot tolerate, and the screening writes itself.

The second mistake is an interview loop with no eval in it. If your process is two algorithm rounds and a behavioral chat, you have measured general engineering and culture and learned nothing about whether this person can make a language model reliable. The interview has to contain the actual job, which means a retrieval failure to debug or an eval to design, scored on reasoning rather than a clean answer.

The third mistake is paying frontier-model salary for API-wrapper work, or its inverse, expecting a junior to own a system that needs a senior. Match the level to the failure mode: a low-stakes internal tool does not need a $300K specialist, and a customer-facing feature where wrong answers cost real money is not a place for someone who has never shipped. I cover that calibration in senior vs junior AI engineers.

The fourth mistake is treating model fluency as the bar. A candidate who can hold forth on every model and technique but has never owned a real evaluation loop or shipped behind a cost ceiling will produce impressive demos and fragile products. Fluency is table stakes; the discipline to measure, debug, and control cost is the actual job. Even something as basic as knowing when prompt caching earns its keep tells you whether someone has felt the bill.

Frequently asked questions

How do I hire an LLM engineer if I cannot evaluate the skills myself?

Hire through a partner that pre-vets for production LLM experience, or bring in a trusted senior practitioner to run your technical screen. Making a $250K bet on a skill set you cannot assess is the single most expensive way to hire, and a pre-vetting partner exists precisely to absorb that risk. You can convert a strong embedded engineer to full-time once you have seen real work, which beats hiring on a resume and three interviews.

What is the difference between an LLM engineer and an AI engineer?

LLM engineer is the applied-systems specialist who owns the behavior of a language model in production: retrieval, prompting, tool use, evals, routing, and cost. "AI engineer" is the broader umbrella that can also include training-and-modeling work closer to data science. For most product teams hiring today, the LLM engineer is the role you actually need, because the model already exists and the work is making it reliable.

How much does it cost to hire an LLM engineer?

Senior US base salaries for LLM and generative-AI specialists run roughly $240K-$350K+ as of 2026, a premium over general engineering driven by a demand-to-supply ratio near 3.2 to 1. Embedded or partner engagements trade a monthly rate for speed and lower vetting risk. The bigger number to watch is the cost of a wrong hire, commonly 1.5x to 3x salary once you count ramp, opportunity cost, and rehire.

How long does it take to hire an LLM engineer?

On the open market, expect roughly four to five months for a senior specialist, given the structural shortage. A pre-vetting partner can compress that to days because the screening is already done; that speed is often the deciding factor when a roadmap is waiting on the seat. Either way, start sooner than feels comfortable, because the lead time is the part teams consistently underestimate.

If you want the full hiring philosophy underneath this, roles, sequencing, and how to staff for judgment rather than throughput, it is in my book Building an AI-Native Team and the pillar guide to hiring AI engineers. And if you would rather skip the search entirely, Devlyn places pre-vetted senior LLM engineers screened for everything in this article. Hire for the discipline. Ignore the demo.

How to Hire an ML Engineer (and What to Look For)

Alpesh Nakrani — Fri, 17 Apr 2026 18:30:00 GMT

How and where to hire an ML engineer, the skills and signals to screen for, what it costs, and when to hire through a partner instead of building in-house.

To hire an ML engineer who actually moves a metric, screen for someone who treats data quality, model validation, and drift monitoring as the job, not the afterthought, and source them through specialist networks or a partner that pre-vets for production experience rather than a general job board. If you cannot vet the candidate yourself, the fastest path is to hire through a partner who can put a pre-vetted senior ML engineer in front of you in days, instead of the four-to-five months an open-market search for this role currently takes.

I have sat on both sides of this table. I started as an engineer, and I now run revenue at Devlyn, where I hire and deploy ML engineers into products that touch paying customers. So I will skip the recruiter platitudes and tell you what separates an ML engineer who turns a notebook into a model that earns its keep from one who burns two quarters on something that demoed beautifully and never survived contact with live data. This is the ML-specialist deep dive under my broader guide to hiring AI engineers.

Key takeaway: An ML engineer is a data-and-modeling hire, not an applied-systems hire. Screen for validation discipline, feature engineering, and drift awareness, not algorithm trivia or Kaggle medals.
The interview must contain real, dirty data. If your loop is a LeetCode round and a culture chat, you are screening for the wrong job. Hand them a leaky dataset and watch whether they catch it.
Cost tracks scarcity, not hype. Senior ML engineers run roughly $200K-$270K total comp in the US, and the wrong hire costs far more than the right salary.
The build-vs-partner decision hinges on one question: can you vet this person yourself? If you cannot, hiring through a pre-vetting partner is faster and cheaper than a wrong full-time hire.
The most expensive mistake is hiring the resume instead of the failure mode you cannot tolerate. Define the role by what must not break, then hire against that.

What an ML engineer actually brings (vs an AI or LLM engineer)

An ML engineer builds, validates, and ships models that learn from your data. That is the whole job, and the words doing the work are "your data." The hard part of this role was never importing scikit-learn or calling fit; any bootcamp graduate can train a model that scores well on a holdout. The hard part is knowing whether that score is real, whether it will hold on next month's traffic, and whether the feature pipeline that fed it in training will feed it the same way in production.

This is where the titles blur, so let me be precise. An ML engineer works in the data-and-modeling layer: datasets, features, training, validation, and the monitoring that catches a model rotting in production. An LLM or general AI engineer often works one layer up, in the applied-systems layer, where a pretrained model is a fixed component and the engineering is everything around it. If you want the full taxonomy, I wrote it up in AI engineer vs ML engineer; the short version is that the ML engineer owns whether the model is correct, and the applied-systems engineer owns whether the product around it is reliable.

Concretely, the ML engineer's work is data pipelines that do not leak, features that generalize, validation that does not lie to you, deployment that survives real load, and drift monitoring that tells you when the world moved out from under your model. None of that shows up in a notebook accuracy cell. All of it shows up three months later when the model that scored 0.94 in training is quietly making expensive mistakes on production data.

I have learned to distrust candidates who lead with which algorithms they have used. The algorithm is the least interesting decision in most production ML; a well-validated gradient boosting model beats a poorly validated transformer almost every time. The durable skill is the judgment around the data and the discipline around the evaluation, not the model zoo on the resume.

The skills and signals to screen for

The skill that predicts success in this role better than any other is validation discipline. An ML engineer who distrusts their own holdout score, who asks how the train and test split was made before they celebrate, has internalized the only habit that keeps production ML honest. If they cannot tell you how a model that looks great offline can fail the moment it ships, they have not yet shipped one that did.

The second signal is data-leakage literacy. Ask a candidate how a model can score 0.95 in training and fall apart in production, and a strong one will not say "overfitting" and stop. They will walk you through target leakage, a feature computed with future information, train-test contamination, and training-serving skew where the pipeline computes a feature differently at inference than it did in training. That diagnostic instinct is the difference between someone who debugs a model and someone who retrains it and hopes.

The third signal is feature and pipeline thinking over algorithm worship. Real production lift in most domains comes from better features and cleaner data, not a fancier model. A candidate who reaches for feature engineering and data quality before they reach for a bigger architecture has shipped something that had to work, not just score. They treat the data pipeline as the product, because in production it is.

The fourth signal is simply that they ship and then watch. Plenty of people can train a model and hand off a notebook; far fewer have owned a model in production, watched it drift, and rebuilt the retraining loop that kept it honest. Production experience changes how someone thinks, because production is where you learn that the model is never finished, only monitored. For the full screening playbook, see how to vet AI engineers and the interview questions I lean on; the broader skill map is in the skills that actually separate the good ones.

The algorithm is the least interesting decision in production ML. Hire for the judgment around the data, not the model name on the resume.

A signal-by-signal screening table you can run

Signal	What to test	Strong vs weak
Validation discipline	Give a model with a suspiciously high holdout score; ask if they trust it	Strong: interrogates the split, checks for leakage, asks how it was sampled. Weak: celebrates the number and moves on.
Data-leakage literacy	"A model scores 0.95 offline and fails in production - why?"	Strong: target leakage, train-serving skew, contaminated split. Weak: says "overfitting" and stops.
Feature engineering	Ask how they would lift a stuck model without changing the algorithm	Strong: new features, data quality, label review. Weak: "try a deeper network."
Drift and monitoring	Ask what they watch after a model ships and what triggers a retrain	Strong: input drift, prediction drift, ground-truth lag, alert thresholds. Weak: "we'd retrain quarterly."
Reproducibility / MLOps	Ask how a teammate reproduces their result six months later	Strong: versioned data, pinned env, tracked experiments. Weak: "it's in a notebook somewhere."
Production scar tissue	"Tell me about a model that broke after it shipped"	Strong: a specific silent failure and the fix that stuck. Weak: only offline or competition stories.

The pattern across every row is the same. A strong ML engineer treats the offline score as a hypothesis to be disproven and the model as a system to be monitored; a weak one treats the offline score as the finish line. You are hiring for the first kind.

Where to find ML engineers (and how to vet them)

The supply problem is real, so where you look matters. The strongest ML engineers are rarely scanning general job boards; they are employed, building, and reachable through specialist communities, open-source contributions to data and MLOps tooling, technical writing, and referrals from people who have shipped models alongside them. A candidate who has published an honest post-mortem on a model that quietly degraded is worth ten who list "machine learning" as a skill.

Wherever you source them, the vetting bar is the same, and it is not a LeetCode loop. Algorithmic puzzles tell you nothing about whether someone can spot a leaky feature or design a validation scheme. The single highest-signal screen is a small, paid take-home built around realistic, dirty data: here is a dataset with a subtle leak and a misleading metric, build something you would actually deploy and tell me what you do not trust about it. How they reason through that beats any whiteboard round.

I once watched a team nearly pass on a quiet candidate who fumbled the algorithm trivia, then ace the take-home by refusing to report a number until she had found that a timestamp feature was leaking the label. They hired her. She turned out to be the best ML engineer on the team, precisely because her instinct was to distrust the score before she trusted it. The trivia round would have screened her out; the data-shaped exercise screened her in. The details are changed, but the lesson is not.

The mirror-image story is the candidate who dazzled in the interview, named every architecture, and shipped a churn model that looked brilliant offline and degraded within weeks because nobody had built the monitoring to notice the input distribution had shifted. Both are composites. Both point the same direction: vet for the discipline around the data, not the vocabulary around the models.

What it costs to hire an ML engineer

Compensation for this role is high because the talent is genuinely scarce, not because of hype. As of 2026, the average ML engineer in the US earns around $162K base and roughly $212K in total compensation, with senior engineers reaching about $235K base and $270K total, per the Built In salary data. At the very top, FAANG and frontier-lab packages run well past that once stock is included. Those numbers price the gap between someone who can train a model and someone who can ship one that keeps working. I break the full picture down in what an AI engineer costs.

The scarcity behind those numbers is structural and durable. The roles that feed ML engineering are among the fastest-growing in the economy: the Bureau of Labor Statistics projects data-scientist employment to grow about 36 percent and computer-and-information-research-scientist roles about 26 percent over the 2023-2033 decade, far outpacing the average occupation (R&D World, citing BLS). Demand that outruns supply by that margin is exactly why time-to-hire on the open market stretches into months for a strong ML engineer.

The cost that gets ignored is the cost of getting it wrong. A failed senior technical hire is commonly estimated at 1.5x to 3x annual salary once you count ramp time, severance, the opportunity cost of the unbuilt roadmap, and the rehire. For a $230K ML role, that is a $345K to $690K mistake, and it is far more likely when you cannot evaluate the person you are hiring. The expensive part of hiring is not the salary; it is the wrong salary attached to the wrong person.

In-house vs hiring through a partner

The build-vs-partner decision is not about cost first; it is about your ability to vet and the time you have. Hiring a full-time ML engineer into your own org is the right move when modeling work is core and recurring, when you can credibly evaluate the candidate, and when you can afford to wait months to fill the seat. If all three are true, hire in-house and own the capability. I lay out that trade in detail in the companion piece on hiring an LLM engineer, which covers the applied-systems cousin of this role.

The case for hiring through a partner gets strong the moment one of those conditions fails. If you cannot confidently vet an ML engineer yourself, you are making a $230K-plus bet on a skill set you cannot assess, and a partner who has already done the vetting absorbs that risk. If you need someone shipping in weeks rather than months, a pre-vetting partner skips the multi-month open-market search. And if the work is real but not yet a permanent headcount, an embedded specialist lets you move now without committing to a hire you might not need in a year.

This is the gap Devlyn is built to close. If you would rather not run a multi-month search and a vetting loop you are not equipped to run, Devlyn can put a pre-vetted senior ML engineer in front of you in 48 hours, screened for exactly the signals in the table above: validation discipline, feature engineering, drift monitoring, reproducibility, and production ownership. You keep the option to convert to full-time once you have seen the work, which is a far safer way to make a senior hire than a resume and three interviews.

The honest version of this advice is that a partner is not always the answer. If modeling is your core product surface for the next five years and you have the judgment to hire well, building the team yourself is the better long-term play, and my book Building an AI-Native Team is about exactly that. The partner route wins on speed, vetting risk, and optionality, which is what most teams making their first ML hire are short on.

The mistakes that sink an ML hire

The mistake I see most often is hiring the Kaggle resume instead of the failure mode. Competition skill and production skill overlap less than people assume: a leaderboard rewards squeezing the last decimal out of a fixed, clean dataset, while production rewards noticing the dataset is wrong. Start from the question "what must this model never get wrong, and how would we know?" and hire the person whose instincts are organized around answering it.

The second mistake is an interview loop with no real data in it. If your process is two algorithm rounds and a behavioral chat, you have measured general engineering and culture and learned nothing about whether this person can build a model you can trust. The interview has to contain the actual job, which means a messy dataset to validate or a suspicious metric to interrogate, scored on reasoning rather than a clean answer.

The third mistake is ignoring the operational half of the role. A model is not a deliverable; it is a system that needs monitoring, retraining, and ownership long after the launch demo. Hire someone who has lived through a model degrading silently, because they will build the drift alerts and retraining loop from day one instead of discovering they were needed after the model already cost you money. I make the broader version of this case in my piece on why the model you can operate beats the model that benchmarks best.

The fourth mistake is treating offline accuracy as the bar. A candidate who can hold forth on architectures and squeeze a high holdout score but has never owned a real evaluation loop against production-sampled data will produce impressive notebooks and fragile products. Offline accuracy is table stakes; the discipline to validate honestly, monitor in production, and catch your own leaks is the actual job.

Frequently asked questions

How do I hire an ML engineer if I cannot evaluate the skills myself?

Hire through a partner that pre-vets for production ML experience, or bring in a trusted senior practitioner to run your technical screen. Making a $230K bet on a skill set you cannot assess is the single most expensive way to hire, and a pre-vetting partner exists precisely to absorb that risk. You can convert a strong embedded engineer to full-time once you have seen real work, which beats hiring on a resume and three interviews.

What is the difference between an ML engineer and an AI or LLM engineer?

An ML engineer works in the data-and-modeling layer: datasets, features, training, validation, and drift monitoring, and owns whether the model is correct. An LLM or applied AI engineer works one layer up, treating a pretrained model as a fixed component and owning the system around it. For a team that needs custom models trained on its own data, the ML engineer is the role you actually need; for a team building on top of a foundation model, it is usually the applied engineer.

How much does it cost to hire an ML engineer?

In the US as of 2026, the average ML engineer earns around $162K base and roughly $212K total compensation, with senior engineers near $235K base and $270K total, and frontier-lab packages running higher once stock is counted. Embedded or partner engagements trade a monthly rate for speed and lower vetting risk. The bigger number to watch is the cost of a wrong hire, commonly 1.5x to 3x salary once you count ramp, opportunity cost, and rehire.

What is the single best screening signal for an ML engineer?

Whether they distrust their own offline score. The strongest ML engineers interrogate how a model could be fooling them, target leakage, training-serving skew, a contaminated split, before they trust a high number, and they build the monitoring to catch drift after it ships. A take-home around realistic, slightly-broken data surfaces that instinct faster than any whiteboard round.

If you want the broader hiring playbook this fits inside, start with my guide to hiring AI engineers and the team-design thinking in Building an AI-Native Team. And if you would rather skip the multi-month search and the vetting loop you are not equipped to run, Devlyn can put a pre-vetted senior ML engineer in front of you in 48 hours, screened for the validation and production discipline that actually predicts a model worth shipping. Hire for the judgment around the data. Ignore the medals.

How to Hire an MLOps Engineer (Without Getting Burned)

Alpesh Nakrani — Thu, 16 Apr 2026 18:30:00 GMT

Hiring an MLOps engineer is a reliability bet, not a tooling checklist. Here is what the role owns, how to vet for it, what it costs, and when you actually need one.

When you hire an MLOps engineer, you are hiring for one thing above all others: the ability to keep a model reliable in production after the demo is over. Not the longest tool list on the resume, not the most certifications, not the prettiest architecture diagram. The person you want is the one who can take a model that works in a notebook and make it deploy, monitor, roll back, and stay cheap enough to run, every day, without a human babysitting it. Everything else is learnable. That judgment is the scarce thing, and it is what separates a strong MLOps hire from an expensive one.

I have hired and deployed senior AI and ML engineers at Devlyn, and I sit in two seats at once: I read the deployment logs and I read the P&L. From that seat, the pattern is consistent. Most teams hiring their first MLOps engineer screen for the wrong things, anchor on the wrong cost, and only find out the hire was wrong when a model silently drifts in production and nobody notices until a customer does. This piece is the specialist deep-dive that branches off my pillar guide to hiring AI engineers, and it is written for the person who has already decided they need this role and wants to get it right the first time.

If you would rather not run a three-month search for a role you cannot fully vet yourself, you can buy the judgment pre-vetted. That is exactly what the Devlyn MLOps engineering team exists for: engineers who own the reliability surface, on a transparent rate, with a trial period instead of a hiring gamble. But whether you build or buy, you need to know what good looks like, so let me give you that first.

Hire for reliability ownership, not tooling breadth. The scarce skill is keeping a model healthy in production, not naming the most platforms.
MLOps is at least three jobs in one title. Platform, infrastructure, and applied MLOps look different on a resume and cost different money. Screen for the one you need.
The role is defined by what happens after deploy. Monitoring, drift detection, rollback, and cost control are where MLOps earns its salary, and where weak hires quietly fail.
The cost that matters is loaded, not the salary line. A US MLOps engineer runs roughly $130K to $200K base, and the salary is the smallest part of the true cost.
Hiring before you have anything in production is the classic mistake. If nothing ships yet, you may need an AI or ML engineer first, not an MLOps specialist.

What an MLOps engineer actually owns

An MLOps engineer owns the path a model takes from a trained artifact to a reliable production service, and everything that keeps it healthy once it is live. That is the whole job in one sentence, and the word that carries the weight is reliable. A data scientist or ML engineer can produce a model that scores well offline. The MLOps engineer is the person who makes sure that model deploys repeatably, serves at the latency and cost your product can afford, and tells you when it starts to fail before your customers do.

Concretely, the surface they own breaks into four areas. First, pipelines and reproducibility: training and data pipelines that run the same way twice, experiment tracking, a model registry, and lineage so you can answer "which data and code produced the model in production right now." Second, deployment and CI/CD for models: packaging a model, getting it behind a serving layer, automating the release, and making rollback a one-command operation rather than a fire drill.

Third, and this is the area weak hires neglect, monitoring and drift detection. A model does not throw a stack trace when it gets worse. It just quietly degrades as the world shifts away from its training distribution, and the only way you find out is if someone instrumented the inputs, the outputs, and the downstream outcomes to catch it. Fourth, cost and performance: an MLOps engineer who does not watch inference spend will hand you a model that works and a bill that does not, which is why I treat inference cost as a first-class MLOps concern, not an afterthought.

A model does not throw a stack trace when it gets worse. It quietly degrades, and the only way you find out is if someone instrumented it to catch the drift before a customer does.

If you want the standard menu of tools attached to these areas, it looks like MLflow or Weights & Biases for tracking, Airflow or a pipeline orchestrator, KServe, SageMaker, or Vertex for serving, and Kubernetes underneath most of it. But here is the thing I tell every founder: the tools are the answer to the wrong question. The right question is whether the person can own the outcome, reliability, when the tool inevitably does not do what the docs promised.

The skills and signals that separate a strong hire from a weak one

The strongest MLOps engineers I have worked with share a trait that does not appear on any certification: they think in failure modes. Ask one how they would deploy a new model and a weak candidate describes a happy path, push, serve, done; a strong one immediately starts talking about what happens when it breaks. How do we shadow-test before cutting traffic over, what is the rollback trigger, and what metric tells us the new model is worse before customers do? That instinct to design for the bad day is the single best predictor of a hire who will save you money rather than cost it.

The second signal is whether they treat monitoring as a product, not a dashboard. Anyone can stand up a Grafana board. The engineer you want connects model behavior to business outcomes, so the alert fires on "the model's decisions are drifting from what good looks like," not just "CPU is high." This is the same discipline I cover in the gap between offline and online evaluation: a model that passed every offline check can still fail online, and MLOps is the function that catches it.

The third signal is judgment about scope. MLOps is at least three different jobs wearing the same title, a platform engineer who builds the internal ML platform, an infrastructure engineer who lives in Kubernetes and serving layers, and an applied engineer who owns one product's models end to end. A strong candidate knows which of those they are and tells you honestly when a problem is outside their lane. A weak one claims all three and is excellent at none. The skills that actually matter are about depth in the lane you need, not breadth across all three.

A screening table you can run in an interview

Here is the rubric I use, distilled. For each signal, there is a test you can run in an hour and a clear read on what strong versus weak sounds like. Paste this into your interview notes and score against it.

Signal	Test	Strong	Weak
Failure-mode thinking	"Walk me through deploying a new model to production."	Leads with shadow testing, rollback triggers, and the metric that catches regression	Describes the happy path; mentions rollback only when prompted
Drift detection	"A model that passed every offline test is degrading in production. Find out why."	Instruments inputs, outputs, and downstream outcomes; reasons about distribution shift	Re-runs the offline eval and is confused when it still passes
Reproducibility	"Which data and code produced the model serving traffic right now?"	Registry, lineage, and versioned pipelines make this a one-minute answer	"I would have to check" with no system to check against
Cost ownership	"This model works but costs $40K a month to serve. What do you do?"	Profiles the spend, proposes batching, quantization, or routing, ties it to the P&L	Treats cost as someone else's problem
Scope honesty	"Which part of MLOps is your deepest lane, and which would you hand off?"	Names a specific strength and an honest gap	Claims to be expert at platform, infra, and applied all at once

None of these tests requires a take-home or a whiteboard algorithm. They require the candidate to reason out loud about production, which is the only environment that matters for this role. If you cannot run these tests confidently yourself because you do not have an MLOps background, that is a signal in itself, and we will come back to what to do about it.

Where to find and vet MLOps engineers

The sourcing channels are the usual ones: your network first, then specialist communities, then platforms. The MLOps engineers worth hiring tend to cluster around the open-source tools they use, MLflow contributors, Kubernetes operators, people active in the ML platform and serving communities. Job boards and general recruiters will send you volume; the volume will be heavy on tool-listers and light on the failure-mode thinkers you actually want.

The real problem is not finding candidates. It is vetting them. MLOps sits at the intersection of software engineering, infrastructure, and machine learning, which means a generalist interviewer can be fooled in both directions, by a strong software engineer who has never owned a model in production, and by a strong researcher who has never shipped reliable infrastructure. The screening table above is your defense, but it only works if someone on your side can tell a real answer from a confident one.

This is where most first-time hirers get burned, and it is the honest case for buying the capability pre-vetted rather than building it cold. If you cannot evaluate the candidate yourself, you are gambling on a three-month search for a role whose failure modes you cannot see. Buying pre-vetted capacity, through MLOps platform development or a dedicated engineer, moves the vetting risk off your plate and onto a team that runs this rubric for a living. I make the full build-versus-buy argument in the pillar guide; for MLOps specifically, the asymmetry is sharper because the cost of an unnoticed production failure is so high.

What an MLOps engineer costs in 2026

Let me give you the salary line first, because it is the number everyone anchors on, and then explain why it is the wrong number to anchor on. In the US in 2026, MLOps engineer base salaries run roughly $90,000 to $257,000 depending on seniority and market, with a national average in the $130K to $165K band (kore1; salary.com puts the average near $131K). Senior MLOps engineers at frontier labs and FAANG-tier employers climb past $300K in total compensation. Offshore and nearshore, the same capacity costs meaningfully less on the rate card.

But the salary line is the smallest part of the true cost, and this is the same lesson I lay out in detail on what an AI engineer actually costs. Add benefits, taxes, equipment, and tooling and a $160K base becomes a loaded cost north of $200K before the person has prevented a single outage. Then add ramp: a new MLOps engineer needs to learn your stack, your models, and your failure history before they can own reliability, and that is months at partial capacity.

The cost that actually matters is the one nobody quotes you: the cost of getting it wrong. An MLOps hire who does not instrument drift hands you a model that degrades silently, and the bill arrives as churned customers and a fire drill, not as a line item. Optimize for cost per reliable, monitored model in production, not cost per hour. The cheapest hour and the cheapest outcome are almost never the same person.

MLOps engineer vs an AI engineer or ML engineer: which do you actually need

This is the question that saves the most money when you get it right and wastes the most months when you get it wrong. The titles overlap and companies use them loosely, but the center of gravity is different for each. An ML engineer leans toward building and training models: feature pipelines, model architecture, fine-tuning. An AI engineer leans toward composing existing models into a working product feature. An MLOps engineer leans toward the operational layer that keeps either of those reliable in production.

The practical decision rule is about where your pain is. If your pain is "we cannot get a good enough model," you need an ML or AI engineer. If your pain is "we have models that work but they keep breaking, drifting, or costing too much in production," you need MLOps. Hiring an MLOps specialist when your real problem is model quality is like hiring a pit crew when you do not have a car yet. The reverse, asking an applied AI engineer to own a production ML platform, is how reliability quietly becomes nobody's job.

The honest answer for many early teams is that you need the AI or ML engineering first and the MLOps shortly after, often in the same person at small scale and split into specialists as volume grows. I walk through the full role taxonomy and the interview questions for each in the hiring cluster, because matching the specialist to the problem is the decision that pays back the most in this whole space.

Three ways MLOps hires fail (and how to avoid them)

I will keep these illustrative and NDA-safe, but the patterns are real and I have watched each of them play out more than once.

The tool-lister. A team hired an engineer whose resume listed every MLOps platform in existence, and he could stand up infrastructure beautifully. But when their recommendation model started degrading, he had no monitoring connecting model behavior to business outcomes, because he had built dashboards for system metrics, not model quality. The model drifted for weeks before anyone noticed conversion sliding. The fix was not more tools; it was the failure-mode thinking the interview never tested for.

The premature hire. A founder hired a senior MLOps engineer before the team had a single model in production. For four months the engineer built an elaborate platform for models that did not exist yet, the team burned a senior salary on infrastructure speculation, and the actual product work, getting a model good enough to ship, stalled because the wrong specialist was in the seat. They needed an AI engineer first.

The unowned monitoring. A team split MLOps across three people, each owning a slice, and monitoring fell into the gap between them. Everyone assumed someone else was watching for drift. When a data pipeline upstream changed format, the model started serving garbage, and the alert that should have caught it had never been built because it was nobody's explicit job. Reliability has to be owned by a named person, not distributed into a gap.

Each of these is avoidable with the screening rubric above and an honest read on whether you have the in-house ability to vet. When you do not, the lower-risk move is to engage a team that has already absorbed these lessons. That is the argument for working with a pre-vetted MLOps engineer rather than running the gauntlet yourself, especially for your first hire in this function.

Frequently asked questions

What does an MLOps engineer do, in one sentence?

An MLOps engineer owns the path a model takes from a trained artifact to a reliable production service, and everything that keeps it healthy afterward: deployment, CI/CD, monitoring, drift detection, rollback, and inference cost. The defining word is reliable. They are the reason a model that worked in a demo keeps working in production without a human standing behind it.

How much does it cost to hire an MLOps engineer in 2026?

In the US, base salaries run roughly $90K to $257K depending on seniority, with a national average around $130K to $165K and senior specialists at top labs exceeding $300K total comp. But the loaded cost, including benefits, ramp, and the risk of a bad hire, is far higher than the salary line, so budget for the outcome, not the rate card.

Do I need an MLOps engineer or an AI/ML engineer?

If your problem is getting a good enough model, hire an AI or ML engineer. If your problem is that working models keep breaking, drifting, or costing too much in production, hire MLOps. Early teams often need the model-building role first and the MLOps role shortly after; at small scale one strong generalist can cover both before you split into specialists.

How do I vet an MLOps engineer if I do not have an MLOps background myself?

Run the failure-mode questions: ask them to walk through a deploy, find a silent drift, and answer which data produced the model serving traffic now. Strong candidates lead with rollback, monitoring, and lineage; weak ones describe a happy path. If you genuinely cannot tell a real answer from a confident one, buy the capability pre-vetted instead of gambling on a search you cannot evaluate.

If you want the full picture on building the team around this hire, my book Building an AI-Native Team covers the role mix end to end, and the pillar guide to hiring AI engineers connects it to the rest of the cluster. And if you would rather have reliability owned from day one without the hiring risk, that is exactly what Devlyn's MLOps engineers are for. Hire for the bad day. The good day takes care of itself.

How to Hire a RAG Engineer Who Survives Production

Alpesh Nakrani — Wed, 15 Apr 2026 18:30:00 GMT

Most RAG engineers can demo retrieval. Few can keep recall from collapsing in production. Here is how to hire the second kind, what they own, and what it costs.

When you hire a RAG engineer, you are not hiring someone to wire up a vector database and a chat box. You are hiring the one person on your team who owns whether the system retrieves the right evidence before the model ever opens its mouth. That is the job. Everything else, the framework names, the embedding model, the orchestration library, is downstream of it, and most candidates who interview well have it backwards.

I have hired and deployed more than 80 senior AI engineers at Devlyn and shipped over 200 products on top of them. A large share were retrieval systems, and the single most expensive hiring mistake in this role is hiring someone who can build a demo on a clean corpus, then discovering in month three that they have no idea why recall is collapsing on real traffic. The demo is the easy 80%; the 20% that keeps a RAG system working after contact with a messy corpus and real users is the entire reason the role exists. This piece is about how to tell the two apart before you sign an offer, and if you want the broader role first, start with my definitive guide to hiring AI engineers.

Key takeaway: A RAG engineer owns retrieval quality, not the demo. The job is keeping the right evidence in front of the model on real, drifting traffic, not making a clean corpus look good in a sprint review.
Screen for the recall-collapse instinct. Anyone can stand up a vector search. The hire you want is the one who reaches for an eval harness and a recall number before they touch the prompt.
Chunking, embeddings, retrieval, and evaluation are the four surfaces. A strong candidate can reason about trade-offs in each and tell you which one is failing from a symptom.
Retrieval is the lever, not the prompt. When answers are weak, the fix is almost always in what got retrieved, and a strong candidate knows that without being led to it.
Cost varies more by sourcing model than by seniority. A senior in-house hire, staff augmentation, and a scoped pod are different price-and-speed trade-offs, not different quality tiers.

What a RAG engineer actually owns

The clearest way to scope this role is by the surfaces a retrieval engineer is accountable for. There are four, and a candidate who cannot speak fluently to all four is a generalist who has done some RAG, not a retrieval specialist.

Chunking. How the corpus gets split before it is embedded. This sounds trivial and is not. Chunking choice alone can swing retrieval recall by roughly 8 to 9 percentage points, and the default chunkers in popular libraries often underperform purpose-built strategies (Chroma's chunking research). A strong RAG engineer treats chunking as a tunable decision tied to document structure and query patterns, not a setting they accept from a tutorial.

Embeddings. Which model turns text into vectors, and how that choice degrades at the extremes of chunk size and vocabulary drift. The hire you want knows that an embedding model is not a permanent decision, that the corpus will move underneath it, and that re-embedding cadence is an operational question with a cost attached.

Retrieval. The actual lookup: dense vector search, sparse keyword matching like BM25, the hybrid of the two, and a reranker to reconcile them. Pure dense retrieval is a fine baseline that degrades sharply as the corpus grows. A retrieval engineer who only knows cosine similarity is a retrieval engineer who has never watched a system age.

Evaluation. The part everyone skips and the part that separates the role from a weekend project. Recall@k, context recall, faithfulness, a frozen golden set of query-chunk pairs run on a schedule. Without this, nobody can tell you whether retrieval is working, only whether the demo looked good. I go deeper on the measurement side in how to evaluate RAG.

The skills and signals that separate strong from weak

Resumes in this role are nearly useless, because the vocabulary is cheap. Everyone lists vector databases, LangChain, embeddings, RAG. None of those words tell you whether the person can keep a system alive. The signals that matter are about judgment under failure, and you have to dig for them.

The strongest signal is an eval-first instinct. Describe a RAG system giving weak answers and watch where the candidate goes. The weak candidate reaches for the prompt, suggests a bigger model, or proposes adding more context. The strong candidate asks what recall looks like, whether the right chunks are even in the retrieved set, and how you are measuring it. That reflex, to interrogate retrieval before generation, is the whole job in one reaction.

The second signal is comfort with the unglamorous operational layer. Parsing is part of retrieval; a PDF that extracts as garbage will never retrieve well no matter how good the embeddings are. Freshness, reindexing, deletion, permission-aware retrieval so users cannot pull documents they should not see, these are the parts that break in production and never show up in a demo. A candidate who lights up about parsing and reindexing has shipped real systems. A candidate who finds them beneath them has not.

The demo is the easy 80%. The 20% that keeps recall from collapsing after contact with a messy corpus is the entire reason the role exists.

The third signal is honesty about trade-offs. Hybrid retrieval is more robust but more expensive to tune; reranking improves quality but adds latency; re-embedding the corpus improves recall but costs money and engineering time. A strong candidate names the cost on the other side of every improvement, while a weak one talks about retrieval as if quality were free. If you want the framework I use to separate observable judgment from resume keywords across the whole market, it is in the AI engineer skills breakdown.

How to vet one: signal, test, and what good sounds like

Here is the screening matrix I actually use. Each row is a signal that matters, the test that surfaces it, and the difference between a strong and a weak answer. Run these in a working session against a real or realistic corpus, not as trivia.

Signal	How to test it	Strong answer	Weak answer
Eval-first instinct	"Answers are wrong. What do you check first?"	Looks at recall and whether the right chunks were retrieved before touching the prompt	Suggests a bigger model or a better prompt
Chunking judgment	"How would you chunk these mixed documents?"	Ties chunk strategy to document structure and query patterns; expects to measure it	Names a fixed token size from a tutorial and stops
Retrieval depth	"Dense retrieval is missing results. Now what?"	Reaches for hybrid (BM25 + dense) and a reranker, explains the trade-offs	Adds more dense results or raises the top-k blindly
Operational ownership	"The corpus changes weekly. What breaks?"	Talks reindexing, freshness, deletion, re-embedding cadence, permissions	Assumes the index is set-and-forget
Recall under drift	"Recall was 0.9, now it is 0.7. Diagnose it."	Walks corpus drift, embedding staleness, chunking mismatch, eval gaps methodically	Has never seen recall move and improvises

The bottom row is the one that matters most, and it is the hardest to fake. Someone who has operated a real RAG system has watched recall degrade and had to diagnose it under pressure. Someone who has only built demos has never seen the number move, because demos run on frozen corpora. That gap is visible in seconds once you ask the right question.

The failure mode to screen hardest for: they can demo RAG but cannot keep recall from collapsing

This is the section I would tattoo on a hiring manager's wrist if I could. The most common, most expensive RAG hiring failure is hiring someone who builds an impressive demo and cannot keep it working three months later. It is so common because the demo is genuinely the easy part now. Tooling has made standing up a retrieval pipeline a weekend exercise. The hard part is everything that happens after real users and a real corpus arrive.

Here is an illustrative composite, NDA-safe, of how it goes wrong. A team hires a sharp engineer who ships a RAG assistant in three weeks, and in the demo it answers internal-knowledge questions cleanly, so everyone is thrilled. Then the corpus grows, documents get edited and deleted, new query patterns show up, and answers quietly get worse, but nobody notices for weeks because there is no eval harness, only vibes. By the time a customer complains, recall has drifted from strong to mediocre, and the engineer's only move is to keep editing the prompt, because retrieval was never the thing they actually understood.

The fix is upstream, in the hire. The engineer who survives this builds the retrieval eval harness before shipping, not after. They put recall@5 on a dashboard next to latency and error rate. They expect the corpus to drift and they instrument for it. This is the central argument of my book on RAG that survives contact with production, and it is also exactly the problem Devlyn's retrieval engineers are hired to solve when an in-house demo has quietly stopped working.

Someone who has only built demos has never watched recall move, because demos run on frozen corpora. That gap is visible in seconds.

If you screen for nothing else, screen for this: has this person operated a retrieval system long enough to watch it degrade, and do they reach for measurement instead of the prompt when it does? The Anthropic team's own work shows how much retrieval engineering moves the needle, combining contextual embeddings, hybrid search, and reranking cut their top-20 retrieval failure rate by 67% (Anthropic's contextual retrieval research). That is the work your hire either does or does not know how to do. The prompt is rarely the lever. Retrieval almost always is.

Where to find a RAG engineer, and what it costs

There are three sourcing models, and they are different trade-offs of price, speed, and commitment, not different quality tiers. The mistake is treating cost as a quality signal. It is not.

In-house senior hire. A full-time senior retrieval engineer in the US market is a substantial commitment, and the loaded cost is well beyond the salary line once you add benefits, equity, and the months of sourcing in the tightest hiring market there is. This is right when retrieval is core to your product and will be for years. It is wrong when you need the system fixed this quarter, because hiring takes longer than the problem will wait. I break down the real loaded numbers in what an AI engineer actually costs.

Staff augmentation. A senior retrieval engineer embedded in your team on a monthly basis. Faster to start, no long-term liability, and you keep direction. The trade-off is that you own the management overhead and the architecture decisions stay yours. This fits when you have a clear plan and need senior hands to execute it.

A scoped pod. A small team that owns the retrieval problem end to end, architecture through eval harness through maintenance cadence. More expensive per month than a single contractor, far cheaper than a bad in-house hire who ships a demo and leaves you with a system nobody can diagnose. This fits when retrieval quality is urgent and you would rather buy the outcome than manage the inputs. It is the model behind Devlyn's retrieval engineering, and it exists precisely because the in-house version of this role takes too long to fill when production is already slipping.

When you actually need one, and when you do not

Not every team that thinks it needs a RAG engineer does. The honest filter is whether retrieval quality is on your critical path to revenue or trust. If a wrong answer costs you a customer, a deal, or a compliance problem, retrieval is core and you need someone who owns it. If you are prototyping and nobody is depending on the answers yet, you do not need a specialist, you need a working baseline and the discipline to instrument it before it matters.

There is also a build-versus-buy question hiding inside this. If your retrieval need is generic, document search over a clean corpus with forgiving accuracy requirements, an off-the-shelf tool may carry you further than a hire. The moment your corpus is messy, permission-sensitive, or high-stakes, generic tools stop being enough and the specialist earns their cost. The decision between owning the retrieval stack and reaching for a model's parametric knowledge is its own question; I cover the adjacent version in RAG versus fine-tuning.

One more honest case: sometimes the right move is neither hire nor tool but a short audit. A two-week retrieval audit by someone senior can tell you whether your problem is chunking, embeddings, retrieval, or evaluation, and that diagnosis is worth more than a year of guessing. It is also a low-commitment way to see how a candidate or a vendor actually thinks before you commit to a longer engagement.

The mistakes that cost the most

The first mistake is hiring for framework names. A resume full of vector databases and orchestration libraries tells you the person has read the docs, not that they can keep a system alive. The frameworks are learnable in a week. The judgment about why recall collapsed is not, and that is the thing you are actually paying for.

The second mistake is hiring someone who blames the prompt. When a RAG system gives weak answers, the reflexive fix is to rewrite the prompt or upgrade the model, because those are the visible knobs. But the evidence the model was handed is upstream of both, and if retrieval handed it the wrong chunks, no prompt will save it. A candidate who instinctively reaches for the prompt is telling you they do not understand where the failure lives.

The third mistake is hiring without an eval plan in place. If you cannot measure recall, you cannot tell whether your new hire is helping or quietly making things worse, and you will find out only when a customer does. Build the golden set and the recall dashboard as part of onboarding, not as a someday project. The cost of skipping it compounds, and I have watched that compounding turn a fine month-one system into a month-three liability, the pattern I unpack in why RAG breaks in month three.

The fourth mistake is optimizing the wrong thing. Teams obsess over which embedding model is two points better on a benchmark while ignoring that their chunking is naive and their corpus is stale, or they pour effort into latency tricks like semantic caching before retrieval quality is even stable. Get retrieval right first; the optimizations matter, but only on top of a system that retrieves the right evidence in the first place. For the architectural version, where the system decides how to query rather than firing one fixed lookup, see agentic retrieval.

Frequently asked questions

What does a RAG engineer do?

A RAG engineer owns whether a retrieval-augmented system finds the right evidence before the model generates an answer. Concretely, that means chunking the corpus, choosing and maintaining embeddings, designing retrieval (dense, sparse, hybrid, reranking), and building the evaluation harness that proves recall is holding up. The role is defined by retrieval quality on real traffic, not by the demo.

How do I tell a strong RAG engineer from a weak one in an interview?

Describe a RAG system giving weak answers and watch where they go first. A strong candidate interrogates retrieval, asking about recall and whether the right chunks were even retrieved, before touching the prompt or the model. A weak candidate reaches for a bigger model or a prompt rewrite. The reflex to measure retrieval before blaming generation is the single most reliable signal.

How much does it cost to hire a RAG engineer?

It depends more on the sourcing model than on seniority. A full-time senior in-house hire carries a high loaded cost and a long time-to-fill in a tight market. Staff augmentation is faster and lower-commitment but leaves management and architecture to you. A scoped pod costs more per month than a single contractor but buys the outcome end to end, and is usually cheaper than a bad in-house hire who leaves you with an undiagnosable system.

Do I need a dedicated retrieval engineer, or can a generalist handle RAG?

If retrieval quality is on your critical path to revenue or trust, hire the specialist. A generalist can stand up a working baseline, but the operational layer, freshness, reindexing, permission-aware retrieval, and the discipline to keep recall from drifting, is where generic experience runs out. The messier and higher-stakes your corpus, the more a specialist earns their cost.

If you have a retrieval system that demos well and is quietly getting worse, or you are about to build one and want it instrumented from day one, that is exactly what Devlyn's retrieval engineers do. And if you want the full operating manual for the role first, RAG That Survives walks through the eval harness, the failure modes, and the maintenance cadence end to end. Hire for the month-three system, not the demo.

How to Hire an AI Agent Developer (and Vet One)

Alpesh Nakrani — Tue, 14 Apr 2026 18:30:00 GMT

Hire an AI agent developer who owns planning, tools, memory, evals, and guardrails, not someone who demos a flashy agent that dies in production.

To hire an AI agent developer, you have three honest paths: a vetted agency or studio that staffs the role for you, a specialist contractor sourced through a technical network, or a full-time hire you screen yourself. Whichever you pick, the single thing to screen for is the same: can this person take an agent from a demo that works once to a system that works on the thousandth messy input. That gap is where most agent projects die, and most of the market does not screen for it.

I have spent the last two years shipping production AI systems and, before that, sat in the CTO and COO seats where the broken systems landed on my desk. I have also been on the other side, staffing agent engineers at Devlyn for companies that came to us after a flashy proof-of-concept fell over in week one. Whether the job posting on your desk says hire an AI agent engineer, hire an agentic AI developer, or something else entirely, the role and the way you vet it are the same. This is the screening framework I use from both seats, written for the person who has to make the hire and live with it.

Key takeaway: An AI agent developer owns a system, not a prompt. Planning, tool contracts, memory, evals, and guardrails are the job, not nice-to-haves.
Screen for the demo-to-production gap. Anyone can wire a tool-calling loop that works in a demo. The skill is keeping it correct, safe, and debuggable on real, malformed, adversarial input.
Vet with a work sample, not a portfolio. Ask how they bound blast radius, what they log, and how they know an agent regressed. Strong answers are specific; weak ones are about model choice.
Match the sourcing model to the risk. A bounded pilot suits an agency or contractor; a long-lived core workflow eventually wants ownership in-house or a dedicated pod.
You may not need one yet. If your task is a single LLM call with a known output, you need a good engineer, not an agent specialist.

What an AI agent developer actually owns

The title gets used loosely, so let me be precise about what the role covers when it is done well. An AI agent developer builds systems where a language model is given a goal, a set of tools, and the ability to decide its own next action in a loop. That autonomy is the whole point, and it is also the whole risk. The job is not writing clever prompts. It is engineering the scaffolding that makes an autonomous loop trustworthy enough to put in front of real work.

In practice, a strong AI agent developer owns five things. They own planning and task decomposition: turning a vague goal into a sequence of bounded sub-tasks the model can actually execute and you can actually verify. They own tool contracts: the function definitions, schemas, and permissions that let the agent act on the world, scoped to least privilege. They own memory: deciding what the agent remembers across steps and sessions, and where that state lives. They own evaluation: the test harness that tells you whether a change made the agent better or quietly worse. And they own guardrails: the approval gates, blast-radius limits, and recovery paths for when the agent does something surprising, because it will.

If a candidate talks only about which model they would use and which framework they prefer, they are describing the easy 10% of the work. The hard 90% is the system around the model. I have written about the engineering reality of this in an honest accounting of what agents can do today, and the short version is that the value lives in a narrow band: tasks that are bounded, reversible, verifiable, and tool-scoped. A developer worth hiring knows that band cold and designs to stay inside it.

If you want help scoping that work or hiring for it, this is exactly what my team builds. Devlyn staffs agentic workflow engineers who own the whole system, not just the prompt.

The skills and signals that separate real agent developers from demo builders

Here is the uncomfortable truth that drives almost everything in this article: building an agent that works once is easy now. The frameworks are good, the models are capable, and a competent engineer can stand up a tool-calling loop that nails a curated demo in an afternoon. That demo tells you almost nothing about whether the person can ship.

The signals that actually matter are about reliability under mess. A real agent developer thinks in failure modes first. They expect the model to hallucinate a tool output and proceed as if it were real. They expect the agent to confidently pursue the wrong goal, get stuck in a loop, or report success on an action that actually failed. So they ask, before writing the happy path, what happens when each step goes wrong and how the system contains it.

The second signal is eval literacy. Ask how they would know if a prompt change made the agent worse. A weak answer is "I would test it and it seemed fine." A strong answer involves a held-out set of representative inputs, labeled outputs, failure modes categorized by severity, and a check that runs every time something changes. Vibes are not evals. If you want to go deep on what good evaluation looks like, my piece on agent evals lays out the harness; a candidate who already thinks this way is rare and worth paying for.

The third signal is operational instinct. They reach for least-privilege tool scopes without being asked. A summarization agent should not have write access to the document store. A drafting agent should not hold a send key. They talk about observability, structured traces of every reasoning step and tool call, because they have been on the wrong end of an agent that did something weird and taken too long to find out why. These instincts come from having operated a system in production, not from having built one in a notebook.

Building an agent that works once is easy now. The skill you are paying for is keeping it correct, safe, and debuggable on the thousandth messy input.

How to vet an AI agent developer: signal, test, strong vs weak

Resumes and portfolios are noise for this role. Everyone has a demo. The only reliable signal is how a candidate reasons through a real design problem, ideally a small paid work sample on a sanitized version of your actual task. Below is the screening table I use. Run each signal as a question, listen for the shape of the answer, and weight the failure-mode and eval rows heaviest.

Signal	How to test it	Strong answer	Weak answer
Failure-mode thinking	"Walk me through what your agent does when a tool call returns garbage."	Validates outputs, detects low confidence, has a fallback and a human-escalation path.	"It usually does not happen" or only describes the happy path.
Eval discipline	"How would you know a prompt change made it worse?"	Held-out labeled set, errors by failure mode and severity, a check that runs on every change.	"I would try it a few times and see how it feels."
Blast-radius control	"This agent can touch our database. How do you scope it?"	Least privilege, read replicas, approval gates on irreversible actions, audit logs.	"It only does what the prompt tells it to."
Memory design	"How does the agent remember something from a run three days ago?"	Distinguishes working, episodic, and semantic memory; names where state lives and how it is retrieved.	"We just put the history in the context window."
Observability	"The agent did something wrong. How fast can you tell me why?"	Traces of reasoning and tool calls, cost and latency by step, sampled human review.	"We have logs of what it did" with no reasoning trace.
Scope honesty	"What would you refuse to let this agent do autonomously?"	Names irreversible financial, legal, or relational actions as off-limits.	"With the right prompt it can handle anything."

One pattern is worth calling out. The best candidates volunteer the limits before you ask. They tell you what they would not automate. That instinct for where autonomy stops earning its keep is the clearest sign of someone who has shipped, not just demoed.

Where to find AI agent developers, and the trade-offs

There is no clean talent pool for this yet. "AI agent developer" is barely two years old as a title, and most people who can genuinely do the work came to it from adjacent roles, backend engineers who got obsessed with reliability, ML engineers who learned to ship, or full-stack builders who lived through a production incident. You are sourcing for a mindset more than a credential.

The realistic channels are three. Specialist agencies and studios give you vetted people fast and absorb the screening risk, which matters most when you cannot yet tell a strong answer from a weak one in an interview. Contractors and freelancers from technical networks are cheaper and good for a bounded pilot, but you carry the vetting burden and the continuity risk if they leave mid-build. Full-time hires give you ownership and institutional memory, but the search is slow and you are competing for a thin pool against companies paying top of market.

My honest guidance: match the channel to the lifespan of the work. A bounded, well-specified pilot, "build us a triage agent for this one queue", is a great fit for an agency or contractor who can prove value in weeks. A core workflow that will live for years and accumulate domain logic eventually wants an owner in-house, or a dedicated pod that hands over real documentation. The mistake is using a short-term channel for a long-lived system, or hiring a full-time specialist before you have a problem worth their time.

What it costs to hire an AI agent developer

Pricing tracks the broader AI engineering market, which runs hot. In the United States in 2026, Built In reports an average AI engineer base salary around $184,757 and average total compensation around $211,243, and agent specialists with real production track records sit at the top of that range or above. Senior people in San Francisco and New York can clear $300,000 in total comp once equity is included.

Contract and agency rates vary widely, and the headline number is the wrong thing to optimize. The real cost of an agent developer is not their rate; it is the cost of getting it wrong. An agent that takes an irreversible action, a wrong refund, a bad customer email sent at scale, a corrupted record that triggers billing, can cost more in one incident than a year of the premium you saved by hiring cheap. I have unpacked the full loaded picture for AI roles in what it really costs to hire an AI engineer; the agent version of that math just has a heavier tail, because the failure modes reach further into your operations.

The real cost of an agent developer is not their rate. It is the cost of one irreversible action taken by a system nobody built guardrails around.

When you actually need one (and when you don't)

Not every AI feature needs an agent, and not every agent needs a specialist. If your task is a single LLM call with a predictable input and a structured output, classify this ticket, summarize this document, extract these fields, you do not need an agent developer. You need a solid engineer who understands the model as one more API. Adding a multi-step autonomous loop to a problem that does not require one just adds failure modes you now have to manage.

You genuinely need an AI agent developer when the task requires the model to take multiple dependent steps, choose its own actions from a set of tools, and adapt based on intermediate results, and when getting that loop wrong has real consequences. Multi-system back-office automation, support resolution that touches several tools, research workflows that gather and synthesize across sources: these are agent-shaped, and they reward someone who has built the scaffolding before. If you are still deciding whether your problem is agent-shaped at all, my guide to building AI agents and the broader piece on agentic workflows will help you draw the line before you spend on the hire.

This also connects up to the bigger staffing question. An agent developer is one role on an AI-native team, and how it fits with your other hires matters. The definitive guide to hiring AI engineers sets the broader context, and the skills breakdown helps you tell a generalist from a specialist before you write the job description.

The mistakes that burn hirers

The mistake I see most, by a wide margin, is hiring on the demo. A candidate or vendor shows a polished agent doing something impressive on stage, the room gets excited, and the contract gets signed. Then the same agent meets real input, a malformed PDF, a customer phrasing nobody anticipated, an API that times out, and it falls over quietly. Demos are curated; production is not. The gap between demo performance and production reliability for agents is the largest I have seen for any category of software, and it is precisely the gap a good developer is paid to close. Screen for it, or you will pay for it.

A second mistake is skipping evals and discovering regressions in front of customers. One company came to us after a support agent that "tested fine" started sending confidently wrong policy answers at scale after a routine model update changed a behavior nobody had pinned down with a test. There was no eval suite to catch it and no trace to explain it. The fix was not a better model; it was the evaluation and observability discipline that should have been there from day one. The names and details are changed, but the shape of that story repeats constantly.

A third mistake is treating human escalation as an afterthought. Teams ship an agent with no real handoff path, then act surprised when the edge cases, which are exactly the cases that matter, have nowhere to go. A developer worth hiring designs the human-in-the-loop path as a first-class feature, not a failure mode. If you want the deeper framework for building agents that hold up, my book Agents That Actually Work walks through the principles, and the memory systems piece covers the persistence layer where so many of these projects quietly break.

If you would rather not learn these lessons on your own production traffic, that is the work my team does. Devlyn staffs agent engineers who build the guardrails, evals, and escalation paths in from the start, and we scope a bounded proof point before you commit to anything bigger.

Frequently asked questions

What does an AI agent developer do? They build systems where a language model is given a goal, a set of tools, and the ability to decide its own next action in a loop. The job is the scaffolding around the model, planning and task decomposition, tool contracts scoped to least privilege, memory across steps and sessions, an evaluation harness, and guardrails for when the agent does something surprising. Writing prompts is a small part of it.

How do I vet an AI agent developer? Use a small paid work sample on a sanitized version of your real task, not a portfolio. Ask how they handle a tool call that returns garbage, how they would know a change made the agent worse, and how they scope an agent's access to your systems. Strong answers are specific about failure modes, evals, and blast radius; weak answers are about which model or framework they prefer.

How much does it cost to hire an AI agent developer? It tracks the AI engineering market, which is hot, average US AI engineer total compensation is reported around $211,000 in 2026, and agent specialists sit at the top of that range or above. But the rate is the wrong number to optimize. The real cost is the price of one irreversible action taken by a system nobody built guardrails around, which can dwarf any savings from hiring cheap.

Do I need an AI agent developer or just an engineer? If your task is a single LLM call with a predictable, structured output, classify, summarize, extract, you need a good engineer, not an agent specialist. You need an agent developer when the task requires multiple dependent steps, autonomous tool choice, and adaptation to intermediate results, and when getting that loop wrong has real operational consequences.

One more honest note worth its own line. The industry is littered with abandoned agent projects for a reason: Gartner has predicted that more than 40% of agentic AI projects will be canceled by the end of 2027, citing escalating costs, unclear business value, and inadequate risk controls. The developers who avoid that statistic are the ones who treat reliability, evaluation, and guardrails as the job, not the afterthought. Hire for that, and if you want a team that already works this way, that is what we do at Devlyn.

How to Hire a Generative AI Engineer (What to Screen For)

Alpesh Nakrani — Mon, 13 Apr 2026 18:30:00 GMT

How and where to hire a generative AI engineer, the production signals to screen for, what it costs, and when to hire through a partner instead.

To hire a generative AI engineer who will actually ship, screen for someone who can turn language, image, and multimodal models into a reliable product feature, scored against evals and a cost ceiling, not a developer who can call an API and demo it once. Source them through specialist communities or a partner that pre-vets for production experience, because the open market for this skill set runs four to five months. The fastest path when you cannot vet the candidate yourself is to hire through a partner who can put a pre-vetted senior engineer in front of you in days.

I have sat on both sides of this table. I started as an engineer, and I now run revenue at Devlyn, where I hire and deploy generative AI engineers into products that touch paying customers in physical stores. So I will skip the recruiter platitudes and tell you what separates an engineer who turns a flashy demo into a margin-positive feature from one who burns six months and a quarter-million dollars on something nobody trusts. This is the generative-AI deep dive under my broader guide to hiring AI engineers.

Key takeaway: A generative AI engineer is an applied-systems hire, not a research hire. Screen for production judgment across text, image, and multimodal models, eval discipline, and cost control, not model trivia or benchmark scores.
The interview should contain the actual job. If your loop is a coding puzzle and a culture chat, you are screening for the wrong role. Give them a generative output that is fluent but wrong and watch how they reason about why.
Cost tracks scarcity, not hype. Senior generative-AI specialists run roughly $240K-$350K+ base in the US, against a demand-to-supply ratio near 3.2 to 1, which is why open-market time-to-hire is months.
The build-vs-partner decision hinges on one question: can you vet this person yourself? If you cannot, hiring through a pre-vetting partner is faster and cheaper than a wrong full-time hire.
The costliest mistake is hiring the resume instead of the failure mode you cannot tolerate. Define the job by what must not break, then hire against that.

What a generative AI engineer actually owns

A generative AI engineer builds reliable product features on top of generative models, the models that produce text, images, audio, and increasingly all three at once. That is the whole job, and the word doing the work is "reliable." Calling the model was never the hard part; any competent developer can get a model to produce something. The hard part is making it produce the right thing, fast enough, and cheaply enough, on the long tail of inputs real users send, every single time.

The surface area is wider than people expect. On the text side it is large language models: retrieval pipelines, prompting, tool calling, structured outputs that downstream code can trust. On the image side it is diffusion models, where the unglamorous work is keeping generated and edited outputs on-brand and safe. On the multimodal side it is models that take an image and a question and return an answer, exactly the kind of feature we ship at Devlyn when a customer points a camera at their face and asks which frames suit them, and a generative AI engineer threads all of it into something that holds up in front of a paying customer.

None of that shows up on a benchmark leaderboard. All of it shows up in your support queue when it is done badly: an image generator that drifts off-brand, a chatbot that invents a return policy, a multimodal feature that confidently misreads the photo. The engineering that prevents those failures is the system around the model, not the model itself.

I have learned to distrust candidates who lead with which models they have used. The model is the least durable part of the stack; it will be swapped twice before the feature is a year old. The durable skill is the system thinking around it, the evals, the guardrails, the cost controls, the graceful failure paths. If you want the broader taxonomy of the role, I wrote it up in what an AI engineer is and the skills that matter.

The skills and signals to screen for

The skill that predicts success in this role better than any other is evals-first thinking. A generative AI engineer who reaches for an evaluation set before they reach for a bigger model has internalized the only discipline that makes generative work tractable. If they cannot tell you how they would measure whether the feature is good, they cannot build a feature that is good, no matter how polished the demo looks. Generative output is subjective enough that without a measurement protocol, "better" is just a feeling.

The second signal is failure-mode literacy across modalities. Ask what breaks in a RAG-backed assistant and a strong candidate will not say "hallucination" and stop. They will walk you through retrieval missing the relevant chunk, the model ignoring the context it was given, and stale embeddings, then tell you how they would isolate which one is firing; ask the same about an image feature and they will talk about prompt adherence, safety filtering, and output consistency. That diagnostic instinct is the difference between someone who debugs and someone who reruns the prompt and hopes.

The third signal is cost and latency awareness as a product concern, not an afterthought. Generative models are expensive to run and slow to respond, and a senior engineer knows a feature that is marginally better but 600 milliseconds slower at the 95th percentile can lose more revenue than it earns. They think about caching, routing easy requests to smaller models, and what a single resolved interaction actually costs, because they have shipped something that had to pay for itself. Even knowing when prompt caching earns its keep tells you whether someone has ever felt the bill.

The fourth signal is simply that they ship. Plenty of people can talk about diffusion and multimodal beautifully and have never put a generative feature in front of a user who could leave a bad review. Production changes how someone thinks, because production is where you learn that the boring failures, a malformed output at 2 a.m. that crashes the parser, are the ones that actually hurt. For the full screening playbook, see how to vet AI engineers and the interview questions I lean on.

The model is the least durable part of the stack. Hire for the system thinking around it, not the model name on the resume.

A signal-by-signal screening table you can run

Here is how I turn those signals into an interview. For each one there is something concrete to test and a clear tell that separates a strong answer from a weak one. Paste this into your hiring doc and run it.

Signal	What to test	Strong vs weak
Evals-first thinking	Give a vague feature ("generate product descriptions"); ask how they would know it works	Strong: defines a frozen, production-sampled set and failure modes first. Weak: jumps to model choice or "we would eyeball it."
Retrieval and grounding	Show an answer that is fluent but wrong; ask what they check	Strong: isolates retrieval miss vs context-ignored vs stale index. Weak: blames "hallucination" and swaps the model.
Multimodal and diffusion literacy	Ask how they would keep an image or vision feature on-brand and safe	Strong: prompt adherence, safety filtering, output evals, human review on the tail. Weak: "the model handles that."
Cost and latency judgment	Ask how they would cut inference cost 50% without hurting quality	Strong: caching, routing, task narrowing, smaller models on the easy tail. Weak: "use a cheaper model everywhere."
Structured output and tool use	Ask how they guarantee downstream code can trust the model output	Strong: schema validation, retries, guardrails, graceful failure. Weak: assumes the model returns clean JSON.
Production scar tissue	"Tell me about a generative feature that broke in production"	Strong: a specific boring failure and the fix that stuck. Weak: only demo or benchmark stories.

The pattern across every row is the same. A strong generative AI engineer treats the model as a replaceable input to a system they own; a weak one treats the model as the system. You are hiring for the first kind.

Generative AI engineer vs AI engineer vs LLM engineer

The titles overlap enough to cause real hiring mistakes, so let me draw the lines. "AI engineer" is the broad umbrella. It can include training-and-modeling work that sits closer to data science, the kind of role I separate out in AI engineer vs ML engineer. A generative AI engineer is the specialist within that umbrella who works with generative models specifically, text, image, audio, multimodal, and threads them into product features.

An LLM engineer is the narrower cousin focused on language models in particular: retrieval, prompting, tool use, and the eval loop around them. In practice the generative-AI title is the wider net. If your product is purely a text assistant, an LLM engineer is the precise hire. If you are generating images, building a multimodal feature, or expect to span several of those, the generative AI engineer is the role you are actually hiring for.

The reason this matters at hiring time is calibration. Write a job description for a generative AI engineer when you only need text work and you will overpay and over-screen. Write one for an LLM engineer when you need image and multimodal capability and you will hire someone who is genuinely strong but missing half the surface area you need. Match the title to the modalities your product touches, not to whichever phrase is trending.

Where to find and vet generative AI engineers

The supply problem is real, so where you look matters. The strongest applied generative AI engineers are rarely scanning general job boards; they are employed, building, and reachable through specialist communities, open-source contributions to eval and generation tooling, technical writing, and referrals from people who have shipped with them. A candidate who has published a thoughtful post-mortem on a generative feature going wrong is worth ten who list "generative AI" as a skill.

Wherever you source them, the vetting bar is the same, and it is not a coding-puzzle loop. Algorithmic trivia tells you nothing about whether someone can debug a grounding failure or design an eval for subjective output. The single highest-signal screen is a small, paid take-home built around a realistic failure: here is a generation pipeline that returns plausible-but-wrong results, find out why and propose a fix. How they reason through that tells you more than any whiteboard round.

I watched a team nearly pass on a quiet candidate who fumbled the systems-design trivia, then ace the take-home by writing an eval harness before touching the prompt and catching that the retrieval index was chunking mid-sentence. They hired her. She turned out to be the best generative AI engineer on the team, precisely because her instinct was to measure before she guessed. The trivia round would have screened her out; the work-shaped exercise screened her in.

The mirror-image story is the candidate who dazzled in the interview, name-dropped every model and framework, and shipped an image feature that drifted off-brand on real traffic because he had never built a single eval against production-sampled inputs. Both stories are composites, but the lesson is not: vet for the discipline, not the vocabulary. The discipline of measuring before guessing is the whole job, and it is exactly what a real evaluation loop forces.

What it costs to hire a generative AI engineer

Compensation for this role is high because the talent is genuinely scarce, not because of hype. As of 2026, senior AI engineer base salaries in the US run roughly $180K-$280K, and generative-AI specialists command a premium on top of that, landing around $240K-$350K+ at the senior level according to the kore1 AI engineer salary guide. That premium is the market pricing the gap between a general engineer and one who can make a generative model behave in production. I break the full picture down in what an AI engineer costs.

The scarcity behind those numbers is structural. Across the market there are roughly 3.2 open AI roles for every qualified candidate, and NLP and generative-model specialists are rated among the most acute shortages, per secondtalent's global AI talent shortage data. That same data puts the global average time-to-hire for these roles near 4.7 months. If you are planning a roadmap around a hire you have not started, that lead time is the number that should worry you.

The cost that gets ignored is the cost of getting it wrong. A failed senior technical hire is commonly estimated at 1.5x to 3x annual salary once you count ramp time, severance, the opportunity cost of the unbuilt roadmap, and the rehire. For a $250K generative-AI role, that is a $375K to $750K mistake, and it is far more likely when you cannot evaluate the person you are hiring. The expensive part of hiring is not the salary; it is the wrong salary.

In-house vs hiring through a partner

The build-vs-partner decision is not about cost first; it is about your ability to vet and the time you have. Hiring a full-time generative AI engineer into your own org is the right move when generative work is core and recurring, when you can credibly evaluate the candidate, and when you can afford to wait months to fill the seat. If all three are true, hire in-house and own the capability. I lay out that trade in detail in in-house vs outsourced AI and when to hire at all.

The case for hiring through a partner gets strong the moment one of those conditions fails. If you cannot confidently vet a generative AI engineer yourself, you are making a $250K-plus bet on a skill set you cannot assess, and a partner who has already done the vetting absorbs that risk. If you need someone shipping in weeks rather than months, a pre-vetting partner skips the four-to-five-month open-market search. And if the work is real but not yet a permanent headcount, an embedded specialist lets you move now without committing to a hire you might not need in a year.

This is the gap Devlyn is built to close. If you would rather not run a five-month search and a vetting loop you are not equipped to run, Devlyn can put a pre-vetted senior engineer in front of you in days, screened for exactly the signals in the table above: retrieval, grounding, multimodal and diffusion work, structured outputs, evals, and cost controls. You keep the option to convert to full-time once you have seen the work, which is a far safer way to make a senior hire than a resume and three interviews.

The honest version of this advice is that a partner is not always the answer. If generative AI is your core product surface for the next five years and you have the judgment to hire well, building the team yourself is the better long-term play. The partner route wins on speed, vetting risk, and optionality, which is exactly what most teams making their first generative-AI hire are short on.

The common mistakes hiring for this role

The mistake I see most often is hiring the resume instead of the failure mode. Teams write a job description that lists every fashionable model and acronym and then interview for keyword coverage, when they should start from the question "what must this feature never get wrong?" and hire the person whose instincts are organized around preventing exactly that. Define the job by the failure you cannot tolerate, and the screening writes itself. I collected the rest of these traps in the AI hiring mistakes I see teams repeat.

The second mistake is an interview loop with no eval in it. If your process is two algorithm rounds and a behavioral chat, you have measured general engineering and culture and learned nothing about whether this person can make a generative model reliable. The interview has to contain the actual job, which means a grounding failure to debug or an eval to design, scored on reasoning rather than a clean answer.

Frequently asked questions

How do I hire a generative AI engineer if I cannot evaluate the skills myself?

Hire through a partner that pre-vets for production generative-AI experience, or bring in a trusted senior practitioner to run your technical screen. Making a $250K bet on a skill set you cannot assess is the single most expensive way to hire, and a pre-vetting partner exists precisely to absorb that risk. You can convert a strong embedded engineer to full-time once you have seen real work, which beats hiring on a resume and three interviews.

What is the difference between a generative AI engineer and an AI engineer?

"AI engineer" is the broad umbrella that can also include training-and-modeling work closer to data science. A generative AI engineer is the specialist who works with generative models, text, image, audio, and multimodal, and threads them into product features. If your product is purely text, an LLM engineer is the more precise hire; if it spans image, vision, or multimodal, the generative AI engineer is the role you actually need.

How much does it cost to hire a generative AI engineer?

Senior US base salaries for generative-AI specialists run roughly $240K-$350K+ as of 2026, a premium over general engineering driven by a demand-to-supply ratio near 3.2 to 1. Embedded or partner engagements trade a monthly rate for speed and lower vetting risk. The bigger number to watch is the cost of a wrong hire, commonly 1.5x to 3x salary once you count ramp, opportunity cost, and rehire.

How long does it take to hire a generative AI engineer?

If you want the full hiring philosophy underneath this, roles, sequencing, and how to staff for judgment rather than throughput, it is in my book Building an AI-Native Team and the pillar guide to hiring AI engineers. And if you would rather skip the search entirely, Devlyn places pre-vetted senior engineers screened for everything in this article. Hire for the discipline. Ignore the demo.

How to Hire a Computer Vision Engineer: What to Look For

Alpesh Nakrani — Sun, 12 Apr 2026 18:30:00 GMT

How to hire a computer vision engineer who survives your real-world images: the skills and signals to screen for, where to find them, what it costs, and when you actually need one.

To hire a computer vision engineer who actually ships, screen for someone who treats messy real-world images, lighting, occlusion, and camera drift as the job rather than an edge case, and source them through a specialist network or a partner that pre-vets for production deployment instead of a general job board. If you cannot vet the candidate yourself, the fastest safe path is to hire a computer vision engineer through a partner who can put a pre-vetted senior in front of you in days, instead of the four-to-six months an open-market search for this scarce role usually takes.

I have sat on both sides of this table. I started as an engineer, and I now run revenue at Devlyn, where I hire and deploy computer vision engineers into products that watch real cameras, scan real documents, and make decisions a customer or an operator has to trust. So I will skip the recruiter platitudes and tell you what separates a CV engineer whose model holds up on your warehouse footage from one whose model scored beautifully on a benchmark and fell apart the first week it saw your actual lighting. This is the computer vision deep dive under my broader guide to hiring AI engineers.

Key takeaway: A computer vision engineer is a data-and-perception hire, not a generic ML hire. Screen for how they handle messy, real-world images, not which architectures they can name.
The benchmark trap is the whole game. A candidate who is "great on COCO" can still ship a model that fails on your cameras under bad lighting, occlusion, and drift. Test against dirty data, not clean leaderboards.
Annotation and label quality decide more outcomes than model choice. The best CV engineers obsess over the data pipeline and the labeling rubric, because that is where production accuracy is actually won or lost.
Deployment and edge constraints are part of the role, not a handoff. A model that needs a datacenter GPU is useless on a $200 camera at the store. Screen for latency, quantization, and on-device thinking.
Cost tracks scarcity, and the wrong hire costs far more than the right salary. Define the role by the failure you cannot tolerate, then hire against that, or hire through a partner who already has.

What a computer vision engineer actually owns

A computer vision engineer builds systems that turn pixels into decisions: detect the object, segment the defect, read the label, count the people, flag the unsafe behavior. That is the job, and the word doing the work is "systems." Anyone can fine-tune a detection model from a tutorial; the hard part is everything that determines whether that model survives the first month against real cameras, real documents, and real lighting that no benchmark dataset ever showed it.

Concretely, the role spans four layers, and a strong CV engineer owns all of them. The first is data and annotation: sourcing representative images, designing a labeling rubric that does not drift across annotators, and catching the class imbalance or mislabeled examples that quietly cap your accuracy. The second is modeling: choosing detection, segmentation, OCR, or video architectures appropriate to the constraint, and knowing when a smaller, faster model beats a heavier one. This is the same operator instinct I argued for in why the model you can operate beats the model that benchmarks best.

The third layer is evaluation, and it is where computer vision diverges from a vague "accuracy" conversation. A CV engineer should reach for the right metric for the task: mean Average Precision for detection, Intersection over Union for how well a predicted box or mask overlaps the ground truth, precision and recall split by class because a single number hides which mistakes you are actually making. IoU is just the overlap between prediction and truth divided by their union (the Jaccard index), and a candidate who cannot explain why 0.9 mAP can still mean a useless product on your hardest class has not shipped one that mattered.

The fourth layer is deployment, including the edge. A retail or industrial vision system frequently runs on a camera, a kiosk, or a small box on the factory floor, not a cloud GPU. That changes everything: latency budgets, quantization, memory ceilings, and the brutal fact that the model has to keep working when the network does not. A CV engineer who has only ever served models from a generous cloud endpoint will be surprised by the constraints that define real-world vision work.

Anyone can fine-tune a detector from a tutorial. The job is everything that decides whether it survives the first month against your real cameras.

The skill that separates production from benchmark

If you remember one thing from this piece, remember this: the gap between a computer vision engineer who looks impressive and one who ships is almost entirely about messy, real-world images. The market is large and getting larger, projected to grow from $19.78 billion in 2024 to $112.10 billion by 2035 at a 17.3% CAGR (MarketsandMarkets), and a lot of that money will be wasted on models that demoed well and never survived contact with production cameras.

Public benchmarks like COCO are clean. The images are well-lit, the objects are centered, the labels are careful, and the test distribution looks like the training distribution. Your data is none of those things. Your cameras have glare at 3pm and shadow at 6pm. Your objects are half-occluded behind a shelf or a forklift. Your lens slowly drifts out of calibration, your document scans are skewed and coffee-stained, and your "rare" class is the one that actually matters for the business. A model that scored 0.92 on a benchmark can drop to something embarrassing on your distribution, and the engineer who does not expect that has not done this before.

So the durable skill is not architecture knowledge. It is the instinct to ask, before writing a line of model code, what your images actually look like, where the lighting and occlusion live, how the camera will degrade over time, and which failures the business genuinely cannot tolerate. The strong candidate treats domain shift as the default condition, not a surprise. The weak candidate treats your messy data as a nuisance standing between them and the clean benchmark they would rather optimize.

This is the same discipline I keep coming back to across roles: the work that matters is the evaluation against your reality, not the score on someone else's. I made the broader version of this case for language models in my guide to evaluation that predicts production, and it transfers directly. A CV engineer who builds an honest evaluation set from your own footage, sliced by lighting and angle and class, is worth more than one who can recite every detection paper from the last three years.

The signals to screen for, and how to test them

The signal that predicts success better than any other is data-and-failure literacy: does the candidate think about images the way production will hand them, or the way a dataset curator prepared them? When you describe your problem, a strong CV engineer immediately asks about lighting variation, camera placement, occlusion, class balance, and how the labels were made. A weak one asks which model they get to use.

The second signal is annotation judgment. Ask how they would build and audit a labeling pipeline for your task, and the strong candidate talks about inter-annotator agreement, edge-case rubrics, and spot-checking labels before trusting any score. They know that a model is a mirror of its labels, and that most accuracy problems are label problems wearing a model costume.

The third signal is deployment realism. Ask what changes when the model has to run at 30 frames per second on a small edge device instead of a cloud GPU, and listen for quantization, model size, latency budgets, and graceful degradation when connectivity drops. A candidate who has only served from the cloud will hand-wave this; a production engineer has felt the pain of a model that was accurate and far too slow to use.

The fourth signal is honest evaluation. The strongest CV engineers distrust their own headline number. They report precision and recall per class, they show you the failure cases, and they can tell you exactly which conditions break the model before you find out in production. If you want the broader screening playbook this fits inside, see how to vet AI engineers and the interview questions I lean on; the wider skill map is in the skills that actually separate the good ones.

A screening table you can run an interview from

Here is the same set of signals as a table you can hand to whoever runs your technical screen: the signal, a concrete test, and what a strong versus weak answer looks like.

Signal	How to test it	Strong answer	Weak answer
Real-world image instinct	Describe your task; watch what they ask first	Asks about lighting, occlusion, camera placement, class balance, label quality	Asks which model or framework they get to use
Annotation judgment	"How would you build and audit our labeling pipeline?"	Inter-annotator agreement, edge-case rubric, spot-checks before trusting scores	"We outsource labeling and train on it"
Honest evaluation	"You report 0.9 mAP. Why might the product still fail?"	Per-class precision/recall, hard-class failures, distribution mismatch	Treats one aggregate number as the verdict
Deployment and edge	"What changes serving at 30 FPS on a $200 device?"	Quantization, latency budget, memory ceiling, graceful offline degradation	Assumes a cloud GPU is always available
Domain-shift awareness	"Your model drops 15 points in production. First three checks?"	Train/serve distribution diff, label leakage, camera/lighting drift	"Retrain on more data" with no diagnosis

Run the loop with at least one task that contains real, imperfect images, ideally a small sample of your own. A take-home or live exercise on slightly broken data surfaces these instincts faster than any whiteboard round, because it forces the candidate to confront the exact conditions that decide whether your project ships.

Where to find and vet a computer vision engineer

Senior computer vision engineers are scarcer than general software engineers and scarcer than the average machine learning hire, because the role demands both deep modeling skill and hard-won deployment experience. General job boards will flood you with candidates who have done coursework and Kaggle competitions and very few who have owned a vision system through its messy production life. The signal-to-noise is poor, and the search is slow.

The build-versus-partner decision is not about cost first; it is about your ability to vet and the time you have. Hiring a full-time CV engineer into your own org is the right move when vision work is core and recurring, when you can credibly evaluate the candidate, and when you can afford to wait months to fill the seat. If all three are true, hire in-house and own the capability. I lay out that trade-off in detail in in-house versus outsourced AI and the companion guide to hiring an ML engineer, the data-and-modeling cousin of this role.

The case for hiring through a partner gets strong the moment one of those conditions fails. If you cannot confidently vet a CV engineer yourself, you are making a senior-salary bet on a skill set you cannot assess, and a partner who has already done the vetting absorbs that risk. If you need someone shipping in weeks rather than months, a pre-vetting partner skips the multi-month open-market search. And if the work is real but not yet a permanent headcount, an embedded specialist lets you move now without committing to a hire you might not need in a year.

This is the gap Devlyn is built to close. If you would rather not run a multi-month search and a vetting loop you are not equipped to run, Devlyn can put a pre-vetted senior computer vision engineer in front of you, screened for exactly the signals in the table above: real-world image instinct, annotation judgment, honest evaluation, and edge deployment. You keep the option to convert to full-time once you have seen the work, which is a far safer way to make a scarce senior hire than a resume and three interviews.

What it costs to hire a computer vision engineer

Cost tracks scarcity, not hype. In the US as of 2026, a mid-level computer vision engineer commonly lands in the range of $150K to $190K base, and a senior with real production deployment experience runs roughly $200K to $270K base, with total compensation higher once equity is counted and frontier-lab or autonomous-vehicle packages running higher still. Treat these as illustrative operator figures rather than quotes; published salary aggregators vary widely and the real number depends on location, domain, and how scarce the specific skill is. (These ranges are illustrative, not pulled from a single source.)

The bigger number to watch is not the salary; it is the cost of the wrong hire. A computer vision engineer who builds a model that benchmarks well and fails in production can burn two quarters and the budget for the cameras and labeling around it before anyone is sure the problem is the model and not the data. The commonly cited cost of a mis-hire runs 1.5x to 3x salary once you count ramp, opportunity cost, and the rehire, and for a specialist role you could not vet in the first place, the high end is the realistic one. For the full breakdown of comp and total cost of ownership, see what an AI engineer actually costs.

An embedded or partner engagement trades a monthly rate for speed and lower vetting risk, which is often the cheaper option once you price in the wrong-hire downside. The math is not "monthly rate versus salary"; it is "monthly rate versus the expected cost of a senior bet you are not equipped to make." For a first vision hire, that framing usually points the same direction.

When you actually need one, and when you don't

Not every vision problem needs a dedicated computer vision engineer, and the honest answer matters because the wrong hire is expensive. If your task is common and well-served by an off-the-shelf API, generic OCR on clean documents, standard object detection on typical scenes, face blurring, basic content moderation, you may not need to build anything custom at all. A capable applied engineer wiring up a vision API can carry you a long way before the marginal accuracy of a custom model justifies a specialist.

You need a dedicated computer vision engineer when the task is specific to your domain and the off-the-shelf options fail on your data: defect detection on your particular product line, shelf or inventory analytics in your store layout, document understanding on your messy forms, safety analytics on your camera angles, anything where the accuracy gap between generic and custom is the difference between a useful product and a toy. The tell is simple: if your problem only gets solved by understanding your images, you need someone who specializes in your images.

If you are still deciding whether this is the moment to add the role at all, the broader timing question is worth its own pass, and I worked through it in when to hire an AI engineer. The short version: hire when the vision capability is core to the product and recurring, not when it is a one-off experiment you could prototype on an API first.

The mistakes that sink a computer vision hire

The mistake I see most often is hiring the benchmark, not the failure mode. A candidate who can hold forth on the latest detection architecture and post strong COCO numbers but has never watched a model degrade against real lighting and occlusion will produce impressive demos and fragile products. Start from the question "what must this system never get wrong, and how would we know?" and hire the person whose instincts are organized around answering it, not around topping a leaderboard.

The second mistake is treating annotation as someone else's problem. A CV engineer who shrugs at label quality and assumes the data team will hand them clean ground truth has not internalized that the labels are the product. The strong hire owns the labeling rubric, audits it, and treats a suspicious accuracy jump as a possible labeling artifact before celebrating it.

The third mistake is ignoring deployment until the end. A model that hits target accuracy in a notebook but cannot run inside the latency and hardware budget of the actual device is not a deliverable; it is a research result. Hire someone who designs for the edge from day one, because retrofitting a heavy model onto a small device after the fact is where vision projects quietly die. For the broader pattern of hiring errors, including the ones that are not specific to vision, see the AI hiring mistakes I keep watching teams repeat.

I have seen a team spend a quarter on a defect-detection model that scored well in validation and then missed the defects that mattered, because the validation set was lit like a studio and the factory floor was not, an NDA-safe composite of a pattern I have watched more than once. I have also seen a strong CV engineer rescue a stalled project in weeks, not by training a better model, but by rebuilding the labeling rubric and the evaluation set so the team could finally see which conditions were breaking it. The model was never the bottleneck. The discipline around the data was.

Frequently asked questions

How do I hire a computer vision engineer if I cannot evaluate the skills myself?

Hire through a partner that pre-vets for production vision experience, or bring in a trusted senior practitioner to run your technical screen. Making a senior-salary bet on a skill set you cannot assess is the most expensive way to hire, and a pre-vetting partner exists precisely to absorb that risk. You can convert a strong embedded engineer to full-time once you have seen real work on your own data, which beats hiring on a resume and three interviews.

What is the difference between a computer vision engineer and a general ML engineer?

A computer vision engineer specializes in perception from images and video: detection, segmentation, OCR, video analytics, and the deployment constraints that come with cameras and edge devices. A general ML engineer works across data-and-modeling problems without that perception focus. For vision-specific work, especially anything with messy real-world imagery, you want the specialist; for broader tabular or modeling work, the general ML engineer is the right hire.

How much does it cost to hire a computer vision engineer?

In the US as of 2026, mid-level computer vision engineers commonly run roughly $150K to $190K base and seniors around $200K to $270K base, with total comp higher once equity is counted and frontier or autonomous-vehicle roles higher still. These are illustrative ranges, not quotes; the real number depends on location and how scarce the specific skill is. Embedded or partner engagements trade a monthly rate for speed and lower vetting risk, and the bigger cost to watch is the 1.5x-to-3x-salary hit from a wrong hire.

What is the single best screening signal for a computer vision engineer?

Whether they think about your images the way production will hand them rather than the way a benchmark prepared them. The strongest CV engineers ask about lighting, occlusion, camera drift, class balance, and label quality before they pick a model, and they report failures per class instead of hiding behind one aggregate score. A take-home on slightly imperfect, real-world images surfaces that instinct faster than any whiteboard round.

If you want the broader hiring playbook this fits inside, start with my guide to hiring AI engineers and the team-design thinking in Building an AI-Native Team. And if you would rather skip the multi-month search and the vetting loop you are not equipped to run, Devlyn can put a pre-vetted senior computer vision engineer in front of you, screened for the real-world image instinct and deployment discipline that actually predicts a vision system worth shipping. Hire for how they handle your messy images. Ignore the leaderboard.

How to Hire an NLP Engineer (and What to Look For)

Alpesh Nakrani — Sat, 11 Apr 2026 18:30:00 GMT

How and where to hire an NLP engineer, the signals to screen for, what it costs, and why the role still matters in the LLM era, from an operator who hires them.

To hire an NLP engineer who will actually move a number, screen for someone who can turn messy domain text into reliable structured output, debug a retrieval or extraction failure under load, and prove it works with evals, then source them through a specialist network or a partner that pre-vets for production experience rather than a general job board. The fastest path, if you cannot vet the candidate yourself, is to hire through a partner who can put a pre-vetted senior NLP engineer in front of you in days instead of the weeks or months the open market currently takes.

I have sat on both seats of this table. I started as an engineer, and I now run revenue at Devlyn, where I hire and deploy NLP and LLM engineers into products that touch paying customers in real stores. So I will skip the recruiter platitudes and tell you what separates an NLP engineer who turns a pile of customer text into a margin-positive feature from one who ships a clever notebook that never survives contact with production. This is the NLP-specialist deep dive under my broader guide to hiring AI engineers.

Key takeaway: NLP engineering did not die when LLMs arrived. The generic parts got commoditized; the domain-specific extraction, classification, retrieval, and eval work still need an owner who can make them reliable.
Screen for the production layer, not the model trivia. The hard part was never running a model. It is making text systems correct, fast, and cheap on the long tail of real inputs.
The interview must contain real text. Give the candidate a messy, ambiguous document and watch how they reason about labels, edge cases, and failure modes. A LeetCode round screens for the wrong job.
Cost tracks scarcity, not hype. NLP specialists run roughly $122K-$200K+ in the US depending on seniority, and AI talent demand outstrips supply about 3.2 to 1, which is why open-market time-to-hire stretches into months.
The most expensive mistake is hiring the resume instead of the failure mode you cannot tolerate. Define the job by what must not break, then hire against that.

What an NLP engineer actually owns now

An NLP engineer builds systems that turn human language into something a product can act on, reliably, at scale. That sentence has not changed in a decade; what changed is the toolkit. Five years ago, an NLP engineer hand-built tokenizers, trained classifiers from scratch, and tuned feature pipelines. Today they orchestrate models that do a lot of that out of the box, so the job moved up the stack but did not disappear.

Here is what the role actually owns in 2026, in the order I see it create or destroy value:

Text classification and intent routing. Deciding what a piece of text is, which bucket a support ticket belongs in, whether a review is a complaint or a question, what a customer actually wants. This is unglamorous and it is everywhere, and a good NLP engineer knows that a 92% classifier on a clean test set can still wreck a workflow if the wrong 8% is your highest-stakes category.

Entity and structured extraction. Pulling names, dates, amounts, SKUs, prescriptions, and clauses out of unstructured documents and into a schema your system can trust. This is where most domain NLP lives, and it is harder than a demo makes it look, because real documents are inconsistent, multilingual, and full of edge cases the happy-path prototype never saw.

Search and retrieval relevance. Making the right thing come back when a user, or a downstream model, asks for it. An NLP engineer who understands ranking, embeddings, and query understanding is the difference between a search box that helps and one that frustrates. This is also the layer that makes or breaks retrieval-augmented generation, which is why it overlaps with the LLM engineer role.

LLM fine-tuning and adaptation. When a general model is not good enough on your domain language, the NLP engineer is the person who decides whether to fine-tune, distill to a smaller model, or fix the prompt and retrieval instead, and who has the data sense to do it without overfitting to a tiny set.

Evaluation on domain language. The discipline that ties it all together. A serious NLP engineer treats evaluation as first-class work, with a frozen, production-sampled test set and error analysis by failure mode, not a vibe-check on ten examples.

NLP engineering did not die when LLMs arrived. The generic parts got commoditized; the domain-specific work still needs an owner who can make it reliable.

NLP engineer vs LLM engineer (and why the line still matters)

This is the question I get most from founders in 2026, usually phrased as some version of "doesn't ChatGPT do NLP now, so why am I hiring an NLP engineer?" It is a fair question and the answer is genuinely useful for scoping the role, so let me be precise rather than diplomatic.

An LLM engineer treats the model as a fixed, powerful component and builds reliable systems around it, prompting, retrieval, tool use, guardrails, cost control. An NLP engineer reaches one layer deeper into the language problem itself, the labels, the schema, the linguistic edge cases, the evaluation of meaning, and is comfortable when the answer is not "call a bigger model" but "your taxonomy is wrong" or "this document type needs its own extractor."

In practice the roles overlap heavily, and a strong senior often covers both, but the distinction matters for hiring because it tells you what to screen for. If your problem is "build a chat experience on top of a capable model," you probably want an LLM engineer. If your problem is "we have ten years of contracts, support logs, or clinical notes and we need to extract and classify them accurately enough to act on," you want someone with real NLP depth, because LLMs alone are confidently wrong on exactly that kind of domain-specific text. The honest version: hire for the failure you cannot tolerate, and let that pick the title.

The skills and signals that actually predict a good hire

I have interviewed and hired enough of these engineers to know that the resume signals everyone screens for are mostly noise. A pile of NLP coursework, a Kaggle ranking, the ability to recite transformer architecture, none of it predicts who ships a reliable text system. Here is what does.

Production judgment over model knowledge. The strongest NLP engineers talk about latency, cost, monitoring, and what happens when the model is wrong, before you prompt them to. The weak ones want to discuss model architecture and benchmark scores. The job is reliability engineering on language, and the candidate's instinct should point there.

Data realism. Ask how they would build a labeled dataset for your problem and listen for whether they understand that labeling is hard, that annotators disagree, and that the label schema is a product decision, not a clerical one. Someone who treats data as a solved input is going to overfit to a clean set and ship something that breaks on real traffic.

Debugging instinct on retrieval and extraction. When extraction misses a field or search returns the wrong document, can they reason from symptom to cause, is it the query, the chunking, the embedding, the prompt, the label, the data? This is the core daily work, and it is the single best thing to test directly.

Eval discipline. A candidate who reaches for "let me define how we measure this before I change anything" is worth more than one who jumps to a fix. Evaluation is what separates an engineer who improves the system from one who just moves the failures around. The skills that actually matter for AI engineers broadly apply here, with extra weight on the language and data layers.

The screening table: signal, test, strong vs weak

Here is how I turn those signals into an interview loop. For each signal, there is a concrete test and a clear tell for a strong versus a weak answer. Paste this into your hiring doc.

Signal	How to test it	Strong answer	Weak answer
Production judgment	Ask what happens when the model is wrong in front of a user	Talks fallback, monitoring, human handoff, cost of the error	Talks accuracy on a benchmark, no failure plan
Extraction debugging	Hand them a messy document where a field is mis-extracted	Isolates cause: chunking, schema, prompt, or data quality	Reaches straight for a bigger model or more prompt text
Data realism	Ask how they would label and version a dataset for your domain	Treats labels as a product decision, plans for disagreement	Assumes labels are obvious and the set is clean
Eval discipline	Ask how they would prove a change helped before shipping	Frozen, production-sampled set; error analysis by failure mode	Eyeballs a few examples and calls it improved
Domain fit	Use real text from your product in the exercise	Asks clarifying questions about edge cases and stakes	Applies a generic pipeline without questioning the domain

Notice that none of these tests are LeetCode and none of them reward trivia. They reward the judgment you are actually paying for. A candidate who has only ever worked on clean academic datasets will struggle with the messy-document exercise, and that is exactly the signal you want before you commit a salary to them.

Where to find and hire NLP engineers

The sourcing problem is real, and it is the part most teams underestimate when they set out to hire NLP engineers. A senior NLP engineer with genuine production experience is scarce, and the open market reflects it. Posting a job and waiting is the slowest path, because the people you want are rarely looking and the people applying are often the generic-pipeline candidates the title attracts.

There are three realistic channels. Specialist communities and referral networks get you the highest signal but the slowest throughput and the most vetting work on your side. Marketplaces like Toptal or Turing get you volume faster but push the vetting burden back onto you. A pre-vetting partner gets you a senior who has already been screened for production work, which is the fastest path when you cannot run a rigorous loop yourself.

That last point is the whole build-versus-partner decision, and it hinges on one honest question: can you vet this person yourself? If you have a senior NLP or AI engineer on staff who can run the messy-document exercise and read the answers, hire direct and take the time. If you do not, you are gambling a six-figure salary and months of runway on an interview loop that screens for the wrong things, and a wrong hire is far more expensive than a partner. We built Devlyn's NLP hiring around exactly that gap, putting a pre-vetted engineer in front of you in days, because the failure mode I watched teams hit over and over was not "we could not afford it," it was "we could not tell who was good until it was too late."

Whichever channel you use, vet against the failure mode, not the resume. Run the same screening table on every candidate, use your own text, and weight production judgment over pedigree. The groundedness and extraction failures that wreck domain NLP systems are the ones a good vetting loop is designed to surface before you hire, not after.

What it costs (and what a wrong hire costs more)

Let me give you real numbers, because cost is where hiring decisions actually get made and where vague advice helps no one. In the US, NLP engineer salaries cluster between roughly $122,000 and $200,000 depending on seniority and location, with the broad average sitting around the $122K-$150K band per public aggregators, and senior specialists in expensive markets running well above it (Coursera, aggregating Talent.com, ZipRecruiter, and Glassdoor figures). Treat those as directional, not gospel; comp moves fast and varies by stack and city.

The scarcity is the more important number. Across AI roles in 2026, demand outstrips supply by about 3.2 to 1, with on the order of 1.6 million open positions globally against roughly 518,000 qualified candidates (Second Talent). That ratio is why a direct open-market hire can take months even when your comp is competitive, and why the salary number on the offer is only part of the real cost.

Now run the math on a wrong hire, because that is the cost most teams ignore. A mis-hired NLP engineer does not just cost their salary; they cost the months before you realize the extraction system is unreliable, the customer trust burned while it shipped wrong answers, the rework when someone competent has to rebuild it, and the opportunity cost of the feature that did not launch. In my experience that compounds to several times the salary line, so the premium on a pre-vetted hire or a partner is cheap insurance and the math is not close.

A wrong NLP hire does not cost their salary. It costs the months before you realize the system is unreliable, plus the trust you burned shipping it.

The hiring mistakes that cost the most

I will close the strategy with the three mistakes I see most, because avoiding them is worth more than any sourcing tactic.

Hiring the resume instead of the failure mode. Teams screen for the most impressive background and end up with someone optimized for a different problem than theirs. Define the job by what must not break, your extraction accuracy on the high-stakes category, your search relevance on the queries that drive revenue, and hire against that specific thing.

Running an interview with no real text in it. If your loop is a coding puzzle and a culture chat, you are screening for a generic engineer, not an NLP engineer. The single highest-signal hour you can spend is handing a candidate a messy, real document from your domain and watching how they reason about it. Skip that and you are hiring on faith.

Treating evals as optional. The engineers who fail in production are almost always the ones who never built a way to know they were failing. A candidate who does not instinctively reach for measurement before changing the system will ship confident, untracked regressions. I wrote up the broader pattern of hiring AI engineers and the deeper team-building version lives in The AI-Native Team, which walks through scoping these roles by the failure modes you cannot tolerate.

A short illustrative example of how this plays out: I have watched a team hire a credentialed NLP researcher who built a beautiful classification model that scored 94% on their held-out set, then watched it route the wrong support tickets to the wrong queue for weeks because the 6% it missed was concentrated in the urgent category, and nobody had built an eval that broke errors out by stakes. The fix was not a better model. It was the eval discipline the hire never had. That is the gap a real interview is supposed to catch, and the cheapest place to catch it is before the offer.

Frequently asked questions

Do I still need to hire an NLP engineer now that LLMs exist?

Often yes, if your problem is domain language rather than general chat. LLMs commoditized generic text tasks, but they are confidently wrong on specialized extraction, classification, and retrieval over your specific documents, and someone has to own the labels, the schema, and the evaluation that makes those systems trustworthy. If your problem is "build on top of a capable model," an LLM engineer may be the better fit; if it is "make sense of our messy domain text reliably," you want NLP depth.

What should I look for when I hire an NLP engineer?

Production judgment first: latency, cost, monitoring, and a plan for when the model is wrong. Then data realism (treating labels as a product decision), debugging instinct on retrieval and extraction, and eval discipline (a frozen, production-sampled test set and error analysis by failure mode). Screen these with a real document from your domain, not a coding puzzle.

How much does it cost to hire an NLP engineer?

In the US, salaries cluster between roughly $122,000 and $200,000 depending on seniority and location, with senior specialists in expensive markets running higher. Because AI talent demand outstrips supply about 3.2 to 1, open-market time-to-hire often stretches into months, so factor the cost of the slow search and the much larger cost of a wrong hire into the decision, not just the salary line.

Should I hire an NLP engineer full-time or through a partner?

It depends on one question: can you vet the candidate yourself? If you have a senior who can run a rigorous, text-based screening loop, hire direct. If you cannot tell a strong NLP engineer from a generic one, a pre-vetting partner is faster and cheaper than risking a six-figure mis-hire, because the expensive failure is not the salary, it is the months lost before you discover the system is unreliable.

If you have an NLP-shaped problem and would rather have a pre-vetted senior engineer in front of you in days than spend months screening for signals you are not sure how to read, that is exactly what Devlyn's NLP engineer hiring is built for. Hire against the failure mode. Test with real text. Measure before you ship.

Hire a Prompt Engineer? When You Actually Need One

Alpesh Nakrani — Fri, 10 Apr 2026 18:30:00 GMT

Hire a prompt engineer only when the skill cannot live inside an AI engineer. Here is what the role really is in 2026, how to screen for it, and what it costs.

If you want to hire a prompt engineer, the honest first answer is that you can, the skill is real and worth paying for, but most teams do not need it as a standalone seat. The work that mattered in 2023, coaxing a model with a clever phrase, has become a small part of a larger job: writing prompts as versioned, tested, governed instructions that sit inside a real product. That job has a name on most teams, and the name is increasingly AI engineer or context engineer, not prompt engineer. I am going to make the case for when a dedicated hire still makes sense, and when you are better off folding the skill into someone who also ships the system around it.

I write this from both seats. I have hired and deployed senior AI engineers at Devlyn and shipped products on top of them, and I read the traces and the P&L on the same afternoon. That vantage point makes me skeptical of the prompt-engineer-as-rockstar framing that peaked a couple of years ago. The people who actually move production quality are not the ones with a thread of viral prompts; they are the ones who can tell you why an output is wrong, build the eval set that proves it, and version the fix so it does not regress next week.

So this is not a recruiter's pitch. It is a buyer's guide to a role whose definition is in motion, written for the founder or engineering leader who has typed "hire prompt engineer" into a search bar and now has to decide what they are actually buying. If you already know you want the skill done right and want to skip the screening gauntlet, you can hire a prompt engineer through Devlyn who treats prompts as versioned, tested, governed instructions. If you want to make the call yourself first, read on.

Key takeaways

If you read nothing else, these are the load-bearing claims:

The skill is real; the standalone title is fading. Prompt work in 2026 means versioned instructions, eval sets, and structured outputs, and that work is increasingly absorbed into AI and context engineering roles.
A dedicated prompt engineer makes sense in narrow conditions. High prompt surface area, a model-migration burden, or a team large enough to specialize. Below that bar, fold the skill into an AI engineer.
Screen for evals, not for eloquence. The signal that separates strong from weak is whether the candidate can prove a prompt got better, not argue that it did.
The expensive mistake is hiring the title instead of the skill. A prompt-whisperer with no measurement discipline will make your system feel better and get worse.

What "prompt engineer" actually means in 2026

The phrase carries a 2023 connotation that gets in the way: a clever person who knows the magic words. That version of the role barely exists anymore, because the magic-words advantage evaporated. Models got better at understanding plain intent, the obvious patterns got documented everywhere, and the gap between an expert prompt and a competent one narrowed to the point where it rarely decides anything in production.

What replaced it is closer to engineering than to copywriting. A prompt engineer worth hiring treats prompts as code: versioned in a repo, reviewed in pull requests, tested against a frozen eval set, and rolled out behind a flag. The deliverables are concrete:

Structured-output schemas that downstream code can rely on.
Tool-call prompts that an agent executes without drifting.
A regression suite that catches the day a model update silently breaks your extraction step.
Migration tests for when you move from one model to another and need to know what changed.

That last item is underrated and is where the role earns its keep. Model providers ship updates constantly, and a prompt tuned for one version can quietly degrade on the next. Someone has to own the evidence that your behavior survived the change. This is the same discipline I describe in my work on LLM evaluation, applied at the prompt layer: you cannot manage what you have not frozen and measured.

The magic-words advantage evaporated. What replaced it is closer to engineering than to copywriting.

So when you say you want to hire a prompt engineer, be precise about which version you mean. If you mean someone to generate snappy completions, you are hiring for a problem that mostly solved itself. If you mean someone to own prompts as a tested, governed part of the system, you are hiring for something real, and the next question is whether it deserves its own seat.

When a dedicated prompt engineer makes sense (and when to fold it in)

Here is the contrarian part, stated plainly: for most teams under a certain size, a standalone prompt engineer is the wrong hire. The skill is essential, but it lives more naturally inside an AI engineer who also builds the retrieval, the orchestration, and the evals around the prompt. Splitting the prompt off from the system that runs it creates a handoff seam, and seams are where production quality leaks.

A dedicated hire starts to make sense under specific conditions. The first is sheer prompt surface area: if you maintain dozens of prompts across many features, each with its own eval set and migration risk, the coordination cost justifies an owner. The second is a heavy model-migration burden, where you are constantly testing behavior across providers and versions and someone needs that as their full-time job. The third is simply team size, once your AI org is large enough that specialization beats generalists, a prompt-and-evals specialist can be a real lever.

Below those conditions, what you usually want is an AI engineer or an LLM engineer who treats prompting as one of several tools. The reason is that the hardest prompt problems are rarely prompt problems. An output that looks like a wording failure is often a retrieval failure, a context-assembly failure, or a missing eval. The person who can see across those layers fixes the actual cause, while the person who only owns the prompt patches the symptom and hands the rest back.

I watched this play out with a team that hired a brilliant prompt specialist to fix a flaky summarization feature. He rewrote the prompt beautifully, the demos improved, and the failures continued, because the real problem was that the upstream chunks fed into the prompt were inconsistent. The fix lived in retrieval, a layer he did not own. The lesson, in this illustrative composite, was not that he was bad; it was that the seam between prompt and system was where the work fell through.

If you are weighing this against adjacent roles, my guide to what an LLM engineer is draws the boundaries, and the broader picture lives in my full guide to hiring AI engineers.

The skills and signals to screen for

Whether you hire the skill standalone or embedded, you are screening for the same thing: measurement discipline. The strongest signal is not how good someone's prompts are in a demo. It is whether they can prove a prompt got better and show you the evidence.

Concretely, a strong candidate builds an eval set before they touch the prompt. They sample real or realistic inputs, define what a good output is, and lock that set so it cannot drift to flatter a result. They version prompts, show you a diff with the metric that moved, and reach for structured outputs by default because a schema is testable and free-form prose is not. And they are fluent in the failure modes that do not show up in a happy-path demo: the long input that overflows context, the adversarial input that jailbreaks the instruction, the model update that shifts behavior overnight.

The weak signals are the opposite of these, and they are easy to miss because they present well. A candidate who talks about prompts in aesthetic terms, who shows you a gallery of impressive completions with no measurement behind them, who cannot describe how they would know the prompt regressed, is selling eloquence, not engineering. Eloquence makes a system feel better in a meeting and get worse in production, which is the most expensive failure mode there is because it hides.

This is the same hire-for-judgment principle that runs through everything I write about AI engineer skills: the frameworks and the model names are learnable, the judgment to know whether an output is correct and why is the scarce thing.

The screening rubric

Here is the rubric I actually use, mapped to a test you can run in an interview or a paid work sample, with what strong and weak answers look like. None of these require you to take the candidate's word for anything.

Signal	How to test it	Strong	Weak
Measurement first	"Improve this prompt." Watch what they do first.	Builds an eval set before editing the prompt	Starts rewriting the prompt immediately
Frozen evaluation	Ask how they know a change helped	Locked, version-named set; reports the diff	"It looked better in testing"
Structured outputs	Give a task with a downstream consumer	Designs a schema the next step can rely on	Returns free-form prose, hopes it parses
Failure-mode fluency	"How does this prompt break?"	Names overflow, jailbreak, model drift	Only describes the happy path
Model migration	"We are changing models next month."	Proposes a regression suite and a diff plan	"We will re-tune the prompt"
Systems sight	Hand them a "prompt bug" that is really retrieval	Traces it past the prompt to the real cause	Patches the prompt and declares victory

The single most diagnostic row is the first one. Hand a candidate a mediocre prompt and ask them to improve it: the weak ones start typing a better prompt within seconds. The strong ones ask what good looks like and how you will measure it, and then build the thing that answers that question before they change a word. That instinct, measure before you tune, is the whole job in miniature.

Where to find and vet a prompt engineer

The sourcing problem is harder than for most roles because the title is unstable. Searching job boards for "prompt engineer" returns a shrinking, noisy pool, with career-switchers who took a weekend course sitting next to genuine engineers. The people you actually want often do not carry the title at all, they are AI engineers, applied scientists, or product engineers who happen to own the prompt-and-evals layer of a shipping system.

That reframes the search. Instead of filtering on a job title, filter on evidence of production prompt work: an eval harness on their GitHub, a writeup of a model migration they survived, a structured-output schema they designed for a real consumer. The signal is shipped, measured work, not a portfolio of clever one-shots.

For vetting, nothing beats a paid work sample on a realistic task. Give them a genuinely flaky prompt from a system like yours, with sample inputs and a vague quality bar, and watch them turn the vague bar into a measurable one; you will learn more in two hours of that than in a day of behavioral interviews. I have screened a lot of candidates who interviewed brilliantly and could not, when handed a real task, do the boring part: define the metric, freeze the set, prove the change. The work sample finds that fast.

One pattern worth naming: a candidate aced a prompt-design whiteboard, then in the work sample shipped a prompt with no eval set and a confident claim that it was "clearly better." It was not measurably anything. The interview rewarded fluency; the work sample exposed the gap. We did not hire. (Illustrative, NDA-safe.)

If you would rather not run this gauntlet yourself, hiring through a team that has already vetted for this discipline is a shortcut, which is one reason we built a way to hire prompt engineers at Devlyn who treat prompts as versioned, tested, governed instructions rather than clever wording.

What it costs

Compensation tracks AI engineering broadly, which is itself a tell that the market does not treat prompt work as a lesser craft. Public salary data puts the median total pay for a prompt engineer around $126,000 a year in the US as of late 2025 (Coursera, citing Glassdoor), with senior and specialist roles climbing well above that. Contract and fractional rates vary widely with seniority and scope.

But the salary is the small number. The expensive number is the cost of getting the hire wrong, and it does not show up on the comp line. A prompt engineer with no measurement discipline ships changes that improve the demo and degrade the system, and because nothing is frozen or measured, the degradation is invisible until a customer finds it. The cost is the failed feature, the human who cleans up the wrong outputs, and the trust you spend with the customer who got the bad answer.

This is the same arithmetic I apply to every AI hire: the comp is a rounding error next to the cost of a confidently wrong system in front of real users. Pay for the judgment that prevents that, not for the title. The framing carries over directly from how I think about hiring an LLM engineer, where the screening bar, not the salary band, is what protects the P&L.

The salary is the small number. The expensive number is a confidently wrong system in front of real users, and it never shows up on the comp line.

The role's evolution, and the mistakes I keep seeing

It helps to see where the role came from. In 2023, prompt engineering was briefly the most-hyped job in tech, and the hype was not baseless, the models were genuinely hard to steer and a good prompt was a real edge. Then two things happened: the models got dramatically easier to instruct, and the prompt patterns that worked got written down everywhere. The scarce skill stopped being the wording and became the system around the wording: the evals, the versioning, the structured outputs, the migration tests.

The data tracks that shift. Microsoft and LinkedIn's 2024 Work Trend Index found that 66% of leaders would not hire someone without AI skills and 71% would take a less experienced candidate who had them over a more experienced one who did not (Microsoft, 2024). AI fluency went from a specialty to a baseline expectation, which is exactly the kind of pressure that dissolves a standalone title into a skill everyone is assumed to have. Commentators have made the stronger claim that the dedicated prompt engineering job is already obsolete (Salesforce Ben); I think that overshoots, the skill is very much alive, but the standalone seat is narrowing.

The mistakes I see follow from missing that shift. The first is hiring the title instead of the skill: posting for a "prompt engineer," screening on prompt aesthetics, and ending up with someone who cannot build the eval set that makes the prompts trustworthy. The second is vibe screening, deciding on a hire because the demo prompts were impressive, with no test of whether they can prove improvement. The third is splitting the prompt off from the system, creating that handoff seam where the real causes of failure, retrieval, context assembly, and missing evals, fall through.

The throughline is the same one I return to constantly: hire for judgment under production pressure, not for a portfolio of artifacts. The artifacts are easy to fake and easy to copy. The judgment to know whether an output is right, why it is wrong when it is wrong, and how to prove the fix held, is the scarce, durable thing, and it is the only thing worth paying a premium for.

Frequently asked questions

Do I need to hire a prompt engineer, or can an AI engineer do it?

For most teams, an AI engineer who owns the prompt-and-evals layer is the better hire, because the hardest prompt problems are usually retrieval or context problems that a prompt-only specialist does not own. A dedicated prompt engineer makes sense when you have a large prompt surface area, a heavy model-migration burden, or an AI org big enough that specialization pays off.

What does a prompt engineer actually do in 2026?

They treat prompts as code: versioned in a repo, tested against a frozen eval set, shipped behind a flag. The deliverables are structured-output schemas, tool-call prompts, regression suites that catch silent model drift, and migration tests for when you change models. The clever-wording era is over; the discipline is measurement.

How do I screen a prompt engineer in an interview?

Hand them a mediocre prompt and ask them to improve it. A strong candidate builds an eval set and defines what good looks like before editing the prompt; a weak one starts rewriting immediately. Then run a paid work sample on a realistic, flaky task and watch whether they turn a vague quality bar into a measurable one.

How much does it cost to hire a prompt engineer?

Median total pay sits around $126,000 a year in the US as of late 2025, with senior and specialist roles higher, and contract rates varying with scope. The bigger cost is a bad hire: a prompt engineer without measurement discipline ships changes that improve the demo and degrade the system invisibly, and that failure is far more expensive than the salary.

If you want the full picture of how this role fits a modern AI org, my book Building an AI-Native Team walks through the roles, cadences, and evidence loops end to end, and my guide to hiring AI engineers covers the wider hire. And if you would rather hire a prompt engineer who already treats prompts as versioned, tested, governed instructions, that is exactly what Devlyn's prompt engineering team is built to do. Hire the skill, screen for the evidence, and do not pay a premium for the title.

How to Hire an AI Solutions Architect (Without Regret)

Alpesh Nakrani — Thu, 09 Apr 2026 18:30:00 GMT

Hire an AI solutions architect to own system design, integration, build-vs-buy, governance, and cost. Here is what the role really owns, how to screen for it, and when you actually need one.

If you want to hire an AI solutions architect, the role you are actually buying is judgment about the whole system, not skill at any single part of it. A good one owns the shape of your AI: how it is designed, how it integrates with the systems you already run, what you build versus what you buy, how it is governed, and what it costs to operate at the volume you expect. Hire for that, and you get a person who keeps your AI program from quietly turning into a pile of demos nobody can ship. Hire for "knows the most about transformers," and you get an expensive prototyper who cannot tell you why the project stalled at the pilot.

I have sat in both seats. I started as an engineer, and I now run a company where I have to explain to customers why the AI we ship is correct, affordable, and safe. That second seat changes what you screen for. The architect who impresses in a whiteboard session and the architect who survives a quarter in production are frequently not the same person, and the gap between them is the most expensive hiring mistake in this category.

So let me be direct about the central decision, because most of the confusion in this market comes from skipping it. You are not hiring someone to write models, you are hiring someone to decide what gets built, what gets bought, what gets connected to what, and what you will be able to defend to a customer, a regulator, or your own board. The best AI solutions architects are operators who can also read a stack trace. If you would rather skip the hunt entirely, my team places enterprise AI architects who design AI that fits the enterprise, and the rest of this piece is the rubric we use ourselves.

Key takeaways

If you read nothing else, these are the load-bearing claims:

An AI solutions architect owns the system, not the model. The job is design, integration, build-vs-buy, governance, and cost, five surfaces, only one of which is "the AI part."
The cheapest signal of a real architect is build-vs-buy honesty. Someone who recommends building everything is selling you their own importance, not solving your problem.
Architect is not a senior AI engineer with a new title. The engineer makes a component work; the architect decides which components should exist and how they connect.
Cost literacy separates the role from a science project. An architect who cannot estimate cost per resolved task at your volume is designing a demo, not a product.
Most companies need the judgment before they need the headcount. A fractional or embedded architect for a quarter often beats a full-time hire you are not yet ready to direct.

What an AI solutions architect actually owns

The cleanest way I have found to define this role is by the five surfaces it is accountable for. Every other definition I have read either inflates the title into "person who knows AI" or shrinks it into "senior engineer." Neither is useful when you are writing a job description or screening a candidate. Anchor on the surfaces and the rest gets clearer.

System design. The architect decides the overall shape: what the pipeline looks like, where retrieval lives, where models sit, how requests route, where humans stay in the loop, and how the thing degrades when a dependency fails. This is the surface people expect, and it is necessary but not sufficient. A clean architecture diagram that ignores the other four surfaces is a poster, not a plan.

Integration. Your AI does not run in a vacuum. It has to read from and write to the systems you already operate, your CRM, your data warehouse, your identity provider, your ticketing system. An architect who has only built greenfield demos tends to underestimate this surface by an order of magnitude, because integration is where the boring, unglamorous, project-killing work actually lives. The model is a weekend; the integration is the quarter.

Build-vs-buy. This is the surface that most directly protects your budget, and the one most candidates are conflicted about. A real architect will tell you when an off-the-shelf API beats anything you would build, even though that answer makes their own role smaller, and they will tell you the narrow places where building genuinely earns its cost. The honest version of this judgment is the single most valuable thing the role produces.

Governance. Who can the system talk to, what data can it touch, where does that data live, what happens when it is wrong, and who is accountable. In healthcare, finance, and any enterprise with real customer data, governance is not a compliance afterthought, it is a design input that shapes the architecture from the first line. An architect who treats governance as something to bolt on at the end has never lost a deal to a procurement team, which means they are about to lose you one.

Cost. Every design choice has a unit economics consequence at volume. The architect should be able to tell you, roughly, what a design costs per resolved task, not per API call, per resolved task, and how that number moves as you scale. I have written about why LLM inference cost is the surface that quietly decides margin, and an architect who cannot reason about it is designing something you cannot afford to run.

A clean architecture diagram that ignores integration, governance, and cost is a poster, not a plan.

AI solutions architect vs AI engineer (and where teams confuse them)

This is the distinction that wastes the most money, so it is worth being precise. An AI engineer makes a component work: they take a defined problem, extract these fields, answer from this corpus, classify this intent, and they build the model, the prompt, the retrieval, the eval, until it hits the bar. They are essential, and the good ones are rare. But their accountability is the component.

An AI solutions architect decides which components should exist and how they connect into something the business can ship, afford, govern, and explain. They are accountable for the system. When the project stalls, it is rarely because a single component failed, it is because two systems would not integrate, or the cost at volume was untenable, or legal killed the data flow. Those are architect failures, not engineer failures, and you cannot fix them by hiring a better engineer.

The confusion happens because the titles overlap on a résumé and because a strong senior engineer often grows into architect judgment over time. But seniority in engineering is not the same as architecture judgment. The senior engineer who has shipped ten models has deep skill on one surface; the architect has shouldered the consequences of decisions across all five. If you want the full map of how these roles fit together, it sits inside my guide to hiring AI engineers, which is the pillar this article hangs off, and how to structure an AI team shows where the architect sits relative to everyone else.

The practical test: ask a candidate to describe a project where they recommended not building something. An engineer often cannot, because their world is "make the thing work." An architect almost always can, because saying no to a build is a core part of the job. The ones who only have build stories are engineers wearing the architect title, and that mismatch is exactly where pilots go to die.

The skills and signals that actually predict a good architect

Forget the certifications for a moment. A cloud-vendor AI architecture cert tells you someone passed a cert. It does not tell you they can design a system that survives contact with your real traffic, your real data constraints, and your real budget. The signals that actually predict success are harder to fake and rarely show up on a résumé.

Build-vs-buy honesty. The strongest single signal. An architect who reaches for the smallest, cheapest, most boring solution that clears the bar, and only builds custom where building genuinely wins, is thinking about your outcome. An architect who wants to build a platform for everything is thinking about their own scope.

Failure-mode fluency. Good architects talk about how systems break, not just how they work. They will tell you, unprompted, where their last design was fragile, what they would change, and what the on-call cost of a bad design choice actually was. Candidates who only have success stories have either never owned a system in production or are not telling you the truth.

Cost reasoning. They should be able to take a rough volume number and sketch the unit economics on a whiteboard, including where the cost concentrates and how to bring it down without wrecking quality. This is downstream of real production experience; you cannot fake it with theory.

Governance instinct. They ask about your data, your regulatory exposure, and your customers' trust before they ask about your model. That ordering is the tell. The architect who leads with governance has lost a deal to it before and learned the lesson the expensive way.

How to screen an AI solutions architect: signal, test, strong vs weak

Here is the rubric I actually use, in one table. Each row is a signal you care about, the test that surfaces it, and what a strong versus weak answer sounds like. The point of a test is to make the signal observable instead of taking it on faith from a résumé.

Signal	The test	Strong answer	Weak answer
Build-vs-buy judgment	"Tell me about a time you recommended not building something."	Concrete story; chose a boring API over a custom build and explained the math	Cannot recall one; defaults to building everything
Integration realism	"Walk me through connecting this to our existing CRM and warehouse."	Talks auth, rate limits, schema drift, failure handling, backfill	Hand-waves "we'll just call the API"
Cost literacy	"At 500k requests a month, what does this design cost to run?"	Estimates cost per resolved task; names where cost concentrates	Quotes a model's per-token price and stops there
Governance	"Where does the customer data go, and who is accountable when it's wrong?"	Leads with data residency, access, audit trail, human-in-the-loop	Treats it as a later compliance checkbox
Failure-mode fluency	"What was the worst design decision you've made, and what did it cost?"	Specific failure, the on-call cost, what changed after	Only success stories; no scar tissue
Eval discipline	"How would you prove this system is good enough to ship?"	Frozen eval set, failure modes by severity, a defensible gate	"We'll test it and see how it feels"

The eval row matters more than its single line suggests. An architect who cannot tell you how they would prove a system is good enough has no defensible basis for any ship decision, which means every launch becomes a vibe. If you want the depth behind that row, my guide to LLM evaluation covers the metrics and the discipline an architect should be fluent in.

What an AI solutions architect costs

Let me give you the honest version, because the salary aggregators only tell you part of it. The market for senior AI talent is hot and getting hotter: PwC's 2025 AI Jobs Barometer found that jobs highly exposed to AI are growing about 3.5 times faster than other occupations, with a roughly 56% wage premium for workers with specialized AI skills (PwC 2025 AI Jobs Barometer, via Hyperight). Robert Half's salary research puts AI and machine learning engineer compensation around the $170,750 mark and rising, with 87% of technology leaders reporting they pay a premium for specialized skills (Robert Half technology salary trends). An architect-level role, more senior, more cross-functional, sits at or above the top of those engineering bands.

So as an illustrative US range, expect a full-time enterprise AI architect to land somewhere in the rough region of $200k to $320k in base, before equity, bonus, and the loaded cost of benefits, recruiting, ramp, and management overhead. Treat those numbers as directional, not a quote, the band moves fast and varies by market and depth. The point is that the sticker price is the smaller half of the real cost.

The loaded cost is what actually matters, and it is the same lens I apply to every AI role in what an AI engineer really costs. A full-time architect you are not ready to direct is more expensive than an empty seat, because they will design ambitiously, you will fund it, and you will discover the mismatch a year and a budget later. This is exactly why the in-house versus outsourced decision matters more for this role than almost any other: a fractional or embedded architect for one quarter costs a fraction of a full-time hire and tells you what you actually need before you commit to a headcount you cannot yet steer.

A full-time architect you are not ready to direct is more expensive than an empty seat.

When an enterprise actually needs one (and when it doesn't)

Not every company that wants AI needs an AI solutions architect on staff, and pretending otherwise sells headcount nobody can use. You need one when your AI ambitions cross a system boundary, when the AI has to touch more than one of your existing systems, serve real customers at volume, or operate under governance constraints you cannot afford to get wrong. At that point the cost of a bad architecture exceeds the cost of the architect, and the hire pays for itself by preventing one stalled pilot.

You do not need one yet when you are still proving a single use case with a single integration and a forgiving audience. That is engineer work, or even capable-generalist work. Hiring a senior architect to babysit one prototype is like hiring a structural engineer to hang a picture, the skill is real and entirely wasted on the task. The readiness signal is integration count and consequence, not enthusiasm for AI.

The honest middle case is the most common one: you have outgrown the prototype, you have two or three integrations on the roadmap, governance is starting to bite, and you are not yet sure the workload justifies a permanent senior hire. That is precisely the case for embedding an architect for a defined engagement rather than opening a full-time req. You get the judgment, you get a reference architecture you own, and you defer the headcount decision until you have evidence. If that is where you are, this is the engagement my team runs most often, scoped, embedded, and designed to leave you with a system you can operate, not a dependency.

The mistakes that waste the hire

The first mistake is hiring for the wrong surface. Teams interview hard on model knowledge and barely test integration, governance, or cost, then are surprised when the brilliant modeler cannot get anything into production. You get what you screen for. If five of your six interview questions are about the model, you are hiring an engineer and calling it an architect.

The second mistake is hiring an architect with no engineering scar tissue. There is a species of "AI strategist" who can draw beautiful diagrams and has never been paged at 3 a.m. because their design fell over. Diagrams are cheap. The value is in the judgment that comes from having owned the consequences, and you can only screen for that by asking about failures and listening for specifics.

The third mistake is hiring full-time before you can direct the role. An architect with no clear mandate will manufacture scope, usually an ambitious internal platform, because that is what underemployed senior talent does. The fix is not a better architect; it is a clearer problem. Define the system boundary you need crossed before you open the req, or embed someone to help you define it first.

The fourth mistake is treating the cert as the qualification. A cloud certification proves familiarity with one vendor's tools. It says nothing about whether the person can make a build-vs-buy call that protects your margin or design a governance model that survives a procurement review. Screen the judgment, not the badge.

Frequently asked questions

What does an AI solutions architect do? An AI solutions architect owns the design of an organization's AI systems across five surfaces: system design, integration with existing systems, build-vs-buy decisions, governance, and cost at scale. They decide which components should exist and how they connect into something the business can ship, afford, govern, and explain, as distinct from an engineer, who builds the individual components.

What is the difference between an AI solutions architect and an AI engineer? The engineer makes a component work; the architect decides which components should exist and how they connect into a system. The engineer is accountable for the model, retrieval, or pipeline they build, while the architect is accountable for whether the whole thing ships, integrates, stays affordable, and can be governed. Hiring a senior engineer expecting architect judgment is the most common and expensive mistake in this category.

How much does it cost to hire an AI solutions architect? As an illustrative US range, a full-time enterprise AI architect typically lands somewhere around $200k to $320k in base before equity and the loaded cost of benefits, recruiting, ramp, and management. The band moves fast and varies by market, so treat it as directional. For many companies an embedded or fractional architect for a quarter is the better first move, because it delivers the judgment without committing to a headcount you may not yet be able to direct.

When does an enterprise need an AI solutions architect? When your AI ambitions cross a system boundary: the AI has to integrate with more than one existing system, serve real customers at volume, or operate under governance constraints you cannot afford to get wrong. Below that threshold, a single use case, a single integration, a forgiving audience, the work is engineer or capable-generalist work, and a senior architect is over-spec for the task.

If you are weighing whether to build this judgment in-house or bring it in, the longer argument lives in my book Building an AI-Native Team, which covers how to structure roles like this so humans stay the bottleneck only where their judgment actually matters. And if you would rather skip the hiring market and embed a senior architect who has owned all five surfaces in production, that is exactly what my team at Devlyn does, alongside the enterprise AI integration work that tends to follow. Hire for judgment about the whole system. Everything else is a component.

How to Hire an AI Product Manager (What to Look For)

Alpesh Nakrani — Wed, 08 Apr 2026 18:30:00 GMT

How and where to hire an AI product manager, the signals to screen for, what an AI PM actually owns, and what it costs in 2026.

To hire an AI product manager who will actually move a product, screen for someone who can own an eval-driven roadmap, reason about model uncertainty, and design UX for a system that is sometimes wrong, then source them through specialist networks or a partner that pre-vets for production AI experience rather than a general job board. The fastest path when you cannot vet the role yourself is to hire through a partner who can put a pre-vetted AI PM in front of you in days, instead of the months the open market currently takes for this title.

I have sat on both sides of this hire. I started as an engineer, and I now run revenue at Devlyn, where I scope, hire, and deploy people who own AI product decisions that touch paying customers. So I will skip the recruiter platitudes and tell you what separates an AI PM who turns a model demo into a shipped, trusted feature from one who spends two quarters writing requirements for behavior the model cannot actually deliver. This is the role-specific deep dive under my broader guide to hiring AI engineers.

Key takeaway: An AI product manager owns the behavior of a probabilistic system, not a feature list. Screen for eval literacy and comfort with uncertainty, not prompt-writing trivia or a polished AI deck.
The interview should contain a judgment test. Hand them a model that is fluent and confidently wrong 8% of the time and ask what they ship. The answer separates an AI PM from a PM who has read about AI.
The roadmap gates on evals, not calendars. A real AI PM ties release decisions to a frozen eval set and a tolerance, not to a date and a vibe-check demo.
Cost tracks scarcity and ambiguity. The title is still being defined in real time, so comp bands are wide and the wrong hire is the most expensive line item, not the salary.
Build-vs-partner hinges on one question: can you vet this person yourself? If your team cannot tell a strong AI PM answer from a confident one, hire through a partner who can.

What an AI product manager actually owns

An AI product manager owns the behavior of a system that is right most of the time and wrong some of the time, in front of real users, and is accountable for what happens in both cases. That sentence is the whole job. A traditional PM ships a feature that either works or has a bug. An AI PM ships a feature whose correctness is a distribution, and the product decisions all live in how you handle that distribution.

The first thing a good AI PM owns is an eval-driven roadmap. Instead of "ship the summarization feature by Q3," the unit of work is "get faithfulness above 0.90 and human-disagreement under 8% on a frozen, production-sampled set, then ship." The roadmap gates on evidence, not on a date. If a candidate cannot describe a release decision in those terms, they will manage your AI product the way they managed a CRUD app, and the model will embarrass you in production.

The second thing they own is model uncertainty as a first-class product input. Every AI feature has a failure rate, and the PM's job is to decide what failure rate is acceptable for this use case and what the product does when it fails. A 5% error rate is fine for a draft-suggestion feature and unacceptable for anything that touches money or medical advice. Sizing that tolerance, and designing the fallback, is product work, not engineering work.

The third thing they own is data and ground truth. In a traditional product, data is something the analytics team reports on after the fact. In an AI product, the labeled examples, the eval set, and the feedback loop are the raw material the whole feature is built on. A serious AI PM treats building and maintaining ground truth as a roadmap item with the same weight as any feature, because without it nobody can say whether the model is getting better or worse.

A traditional PM ships a feature that either works or has a bug. An AI PM ships a feature whose correctness is a distribution, and every product decision lives in how you handle that distribution.

The fourth thing, and the one most overlooked, is the UX of probabilistic systems. When the model can be wrong, the interface has to be designed around that fact: confidence cues, easy correction paths, undo, a human-in-the-loop escape hatch for high-stakes cases. The best AI PMs I have worked with think about the wrong answer as carefully as the right one, because the wrong answer is where trust is won or lost. If you want the evaluation side of this in depth, my guide to LLM evaluation covers how those tolerances get measured.

AI product manager vs traditional product manager

The honest version of this comparison is that an AI PM is a traditional PM plus a specific second literacy, not a different species. The core craft carries over: customer discovery, prioritization, writing clearly, saying no, shipping. What changes is the substrate underneath the product, and the substrate changes enough decisions that a strong generalist PM with no AI exposure will make predictable mistakes in the first quarter.

The biggest difference is certainty. A traditional roadmap assumes that if you build the spec, the feature behaves as specified. An AI roadmap assumes the feature behaves as a distribution you can shift but not fully control, so "done" is defined by a metric threshold, not by a checkbox. An AI PM who does not internalize this writes specs the model can never satisfy and then blames the engineers.

The second difference is the relationship to data and evals. A traditional PM can ship competently without ever touching the eval harness, but an AI PM cannot, because the eval set is how they know whether a change helped. The PMs I trust can read an eval report, argue about whether the rubric is too loose, and tell you which failure mode actually matters for the business. That fluency overlaps with what I look for when designing the human-in-the-loop review that keeps a model honest.

The third difference is the team they sit in. An AI product lives inside a tighter loop with engineering and evaluation than a typical product does, which is why I put the evaluation function near the center when I write about AI team structure. An AI PM who wants to operate at arm's length from the model behavior, the way some PMs operate at arm's length from the codebase, is in the wrong role.

The skills and signals to screen for

Forget the certificate and the list of tools they have touched. The signals that predict a strong AI PM are mostly about how they reason under uncertainty, and you can surface all of them in a single well-designed loop. Here is what I screen for, in priority order.

Eval literacy. Can they define "good enough" as a measurable thing rather than a feeling? Strong candidates reach for a frozen set, a tolerance, and a failure-mode breakdown without prompting. Weak candidates talk about accuracy as a single number and cannot tell you what they would do when it dips.

Probabilistic thinking. Do they treat the model's error rate as a design input or as a bug to be eliminated? The right answer is that some error rate is permanent and the product has to be built around it. Anyone who promises to "get the hallucinations to zero" has not shipped an AI product.

Data and ground-truth fluency. Do they understand that the eval set and the feedback loop are product assets they have to build and defend? A candidate who has never thought about where labeled examples come from will under-resource the one thing that makes the feature improvable.

Shipping judgment under ambiguity is the hardest and most valuable signal. Can they decide to ship a 92%-correct feature with the right guardrails, or do they freeze because it is not perfect? Operators ship with guardrails; theorists wait for certainty that never arrives. The role lives at this exact decision, and it is adjacent to the judgment I screen for across every AI engineering role on the team.

A screening table: signal, test, strong vs weak

Here is the same set of signals as a loop you can run. For each, the test to use in the interview, and how to read the answer. I keep this in front of me during the conversation so I am scoring against the failure mode, not the polish.

Signal	How to test it	Strong answer	Weak answer
Eval literacy	"This model is 92% accurate. Do you ship?"	Asks what the 8% failures are, on what set, and at what stakes before deciding	Says yes or no based on the single number
Probabilistic thinking	"How do you get the error rate to zero?"	Says you do not; you design the product around a residual error rate	Promises better prompts or a bigger model will fix it
Data and ground truth	"How do you know the model improved this week?"	Describes a frozen eval set, sampled from real traffic, scored the same way each time	Points to user sentiment or a one-off demo
UX of being wrong	"The model gives a confident wrong answer. What does the user see?"	Describes confidence cues, easy correction, and a human escape hatch for high stakes	Treats the wrong answer as an edge case to ignore
Shipping judgment	"Perfect is months away. What ships Friday?"	Ships a scoped version with guardrails and a measured tolerance	Waits for the model to be ready, with no date

None of these questions has a trick answer, and none of them rewards memorized vocabulary. They reward someone who has actually owned a probabilistic feature and felt the consequences of getting the tolerance wrong. That is the person you want.

Where to find and vet AI product managers

The supply is thin because the role is new, so the open market is slow and noisy. The pool you are fishing in is mostly traditional PMs adding AI to their resume after one feature, plus a smaller number of people who have genuinely owned a model in production. Telling those two apart is the entire vetting problem, and a general job board will not do it for you.

The channels that work are specialist communities where AI PMs actually congregate, referrals from engineers who have shipped AI features and can vouch for who carried the judgment, and partners that pre-vet for production AI experience. The channel that consistently disappoints is the broad job post, which floods you with the resume-deep candidates and forces your team to run the vetting loop dozens of times.

This is the real fork. If your team already has the eval literacy to run the screening table above and tell a strong answer from a confident one, hire direct and take the time. If you do not yet have that literacy in-house, which is common precisely because you are hiring this role to get it, then running the vet yourself is how good people get rejected and confident people get hired, and a pre-vetting partner is faster and far cheaper than a wrong full-time hire. That is the exact problem the Devlyn team solves when you hire AI product talent through us: we put people who have owned production AI in front of you, already filtered for the judgment this section is about.

I will give an NDA-safe version of how this goes wrong. A founder I advised ran a clean, traditional PM loop, case study, product sense, stakeholder role-play, and hired a sharp PM with a great track record at a SaaS company. Six weeks in, the AI feature was stalled because the PM kept asking engineering to "make it accurate" and could not define what accurate meant or accept that some error was permanent. The skills were real; the second literacy was missing, and the loop never tested for it.

What an AI product manager costs

Comp for this role is unusually messy, because the title is being defined in real time and two people with similar resumes can land in very different bands depending on whether the company codes them as a senior PM with AI skills or as a dedicated AI product manager. So treat any number here as an illustrative range, not a quote. In the US market through 2026, dedicated AI PM total compensation broadly spans a wide band, with senior roles at well-funded companies running materially higher once equity is included, and frontier labs sitting in their own tier.

The demand context is real even where the exact salary data is not standardized. Lightcast and Stanford's AI Index found that roughly 1.8% of US job postings mentioned AI-related skills in 2024, up about 20% year over year (Lightcast / Stanford AI Index 2025), and Lenny Rachitsky's read on the early-2026 product job market is that demand for AI engineers and AI PMs is climbing fast while supply lags (Lenny's Newsletter). Thin supply against rising demand is exactly the condition that stretches time-to-hire into months.

Here is the framing that matters more than the salary line. The cost of an AI PM is not the comp band; it is the cost of the wrong one. A traditional-PM mishire on an AI product does not just underperform, they quietly point the whole team at the wrong definition of done, and you lose a quarter or two before the eval numbers make the problem undeniable. Against that, the difference between bands, or a partner fee, is rounding error.

The cost of an AI product manager is not the salary band. It is the cost of the wrong one, who points the whole team at the wrong definition of done for two quarters.

This is the same calculus I apply to every senior AI hire, and it is why I weight time-to-confidence over time-to-fill. A pre-vetted hire who is productive in week one against a hire who looks good on paper and stalls the roadmap in month two is not a close comparison once you price in the lost quarter.

The mistakes that burn six months

The mistakes in this hire are predictable, which is the good news, because predictable mistakes are screenable. Almost every failed AI PM hire I have seen falls into one of three archetypes, and each one is avoidable if you know the shape in advance.

The first is hiring a model researcher and calling them a PM. Someone with deep ML credentials is not automatically a product manager, and many of them have no interest in the prioritization, the stakeholder work, or the customer discovery that the job actually requires. Research depth is a fine bonus and a poor substitute for product judgment.

The second is hiring a feature-list PM with an AI veneer. This is the most common and most expensive mistake, the one in my earlier story. They are excellent at the traditional craft, they have shipped one AI feature, and they manage the AI product as if it were deterministic. The roadmap is dates, the spec assumes the model obeys, and the team grinds against a definition of done that the model cannot meet.

The third is hiring for the hype keyword instead of the failure mode you cannot tolerate. If you write the JD around "GenAI" and "LLMs" and "agents," you will attract people fluent in the vocabulary and silent on the judgment. Define the role by what must not break in your product, the failure rate you cannot accept and the trust you cannot lose, then hire against that. I make the broader version of this argument across the whole AI engineering skill set, and it holds doubly for the PM who sets the bar everyone else builds to.

Frequently asked questions

What does an AI product manager do? An AI product manager owns the behavior of a probabilistic system in production. They set the eval-driven roadmap, decide the acceptable error rate for each use case, treat the ground-truth data set as a product asset, and design the UX for when the model is wrong. The core PM craft is the same; the second literacy around uncertainty and evaluation is what makes it an AI PM role.

How do I hire an AI product manager? Screen for eval literacy, probabilistic thinking, data fluency, and shipping judgment under ambiguity, using a loop that hands the candidate a fluent-but-wrong model and watches how they reason. Source through specialist communities, engineer referrals, or a pre-vetting partner. Avoid the broad job board, which floods you with resume-deep candidates your team then has to vet one by one.

What is the difference between an AI product manager and a traditional product manager? The craft overlaps; the substrate differs. A traditional PM ships features that work or have bugs and defines done as a checkbox, while an AI PM ships features whose correctness is a distribution and defines done as a metric threshold on a frozen eval set. The AI PM also owns model uncertainty and the data loop, which a traditional PM can usually ignore.

How much does an AI product manager cost? Comp bands are wide and still unstandardized because the title is being defined in real time, so any figure is illustrative rather than a quote. The number that matters more is the cost of a wrong hire, who can point a team at the wrong definition of done for a quarter or two; against that, the difference between salary bands or a partner fee is small.

If you want the full operating model for the team this PM sits inside, including the roles, cadences, and evidence loops, my book Building an AI-Native Team walks through it end to end. And if you would rather skip the months of vetting and have a pre-vetted AI product manager who has actually owned production AI in front of you in days, that is exactly what hiring AI product talent through Devlyn is for. Hire for the judgment. Screen for the failure mode you cannot tolerate.

How to Hire a Python Developer for AI (What to Look For)

Alpesh Nakrani — Tue, 07 Apr 2026 18:30:00 GMT

How to hire a Python developer for AI: the skills and signals to screen for, the generalist-versus-specialist trap, what it costs, and when to hire through a partner.

To hire a Python developer for AI who actually ships, screen for someone who can move between the data layer and the application layer with equal comfort, who treats evaluation and failure modes as the job rather than the afterthought, and who has put a model or an LLM feature in front of real users and watched it break. If you cannot vet that yourself, the fastest path is to hire through a partner who can put a pre-vetted senior AI Python developer in front of you in days, instead of running the multi-month open-market search a strong one now requires.

I have sat on both sides of this table. I started as an engineer, spent a decade as a CTO and COO, and I now run revenue at Devlyn, where I hire and deploy Python developers into AI products that touch paying customers. So I will skip the recruiter platitudes and tell you what separates a Python developer who turns a notebook into an AI feature that earns its keep from one who burns a quarter on something that demoed beautifully and never survived contact with live traffic. This is the Python-specialist deep dive under my broader guide to hiring AI engineers.

Key takeaway: A Python developer for AI is a both-layers hire. Screen for judgment across data, modeling, and serving, not for the longest framework list on the resume.
A generalist Python web developer is not an AI Python developer. Building a Django CRUD app and shipping a reliable LLM feature share a language and almost nothing else.
The interview must contain real, messy AI work. A LeetCode round tells you nothing about whether someone can debug a hallucinating pipeline or a model that rots in production.
Cost tracks scarcity, not hype. Python developers run roughly $112K base in the US on average, and AI-specialist talent commands more; the wrong hire costs far past any salary.
The build-versus-partner call hinges on one question: can you vet this person yourself? If not, a pre-vetting partner is faster and cheaper than a wrong full-time hire.

Why Python is the AI stack, and what an AI Python developer owns

Python is not winning the AI race because it is fast; it is winning because the entire ecosystem of AI tooling is written in it or exposes a Python interface first. In the 2024 Stack Overflow Developer Survey, Python sat among the most-used languages overall, and the data-science and AI libraries that matter, NumPy, pandas, PyTorch, and TensorFlow, all surfaced as the dominant frameworks in that space (Stack Overflow 2024 survey). When you hire a Python developer for AI, you are hiring into the language where the work actually happens.

But the language is the floor, not the job. The role spans more surface area than people expect. An AI-focused Python developer owns the data pipeline that feeds a model, the training or fine-tuning loop where one exists, the inference path that serves predictions or LLM calls under real latency, and the evaluation harness that tells you whether any of it is working. That is a wide brief, and most resumes cover one slice of it convincingly and bluff the rest.

The honest version of this role is that it sits between two worlds. On one side is the data-and-modeling work that an ML engineer owns. On the other is the applied-systems work of wiring a model into a product that holds up at 3am. A strong AI Python developer is comfortable in both, which is exactly why they are hard to find and easy to misjudge in an interview built for a generic backend role.

I have learned to distrust candidates who describe the job as "calling the model." Calling the model is the trivial part. The work is everything around the call: shaping the input, handling the failure when the output is wrong, measuring whether it was wrong at all, and keeping the cost and latency inside a budget the business can live with. That is where the value is, and it is the part a generalist almost never has scar tissue in.

The libraries and skills that actually matter

Start with the data layer, because everything downstream inherits its mistakes. A real AI Python developer is fluent in numpy and pandas, not as trivia but as instinct: they reach for vectorized operations over loops, they know where a join silently duplicates rows, and they treat a data pipeline as the product because in production it is. If the features are wrong, the smartest model in the world learns the wrong thing confidently.

On the modeling side, the framework matters less than the judgment around it. PyTorch is the de facto research and production deep-learning library, and a candidate who has trained or fine-tuned in it should be able to explain a training loop, a loss curve that is lying to them, and why a model that scored well offline degrades on next month's traffic. For most teams building on foundation models, though, the relevant fluency is the LLM SDKs, the OpenAI and Anthropic clients, structured outputs, tool calls, and retrieval, not training a network from scratch.

On the serving side, FastAPI is the skill that turns a model into a product. An AI feature is an async, latency-bound, failure-prone service, and a developer who understands async Python, request lifecycles, timeouts, and streaming responses will ship something that holds under load. One who only knows the notebook will hand you something that works once on their machine and falls over the first time two users hit it at once.

The skill that ties it together, and the one I weight most, is evaluation discipline. A Python developer for AI who cannot tell you how they would know their LLM feature is wrong in production has not yet shipped one that was. The strongest candidates build an evaluation loop against real, production-sampled data before they trust a single output, and they treat type hints, tests, and reproducibility as table stakes rather than nice-to-haves. For the broader map of what separates the good ones, see the skills that actually matter.

Calling the model is the trivial part. The work is everything around the call: shaping the input, catching the wrong output, and proving it was wrong at all.

A generalist Python developer is not an AI Python developer

This is the most expensive misunderstanding I see buyers make. They reason, correctly, that AI work happens in Python, and then conclude, incorrectly, that any strong Python developer can do it. A senior Django engineer who has shipped a decade of clean web applications is a genuinely valuable hire. They are also, in most cases, the wrong person to own your LLM feature, and putting them on it sets both of you up to fail.

The gap is not language; it is the shape of the problem. Traditional software is deterministic: the same input produces the same output, and you test it by asserting equality. AI systems are probabilistic: the same input can produce different outputs, "correct" is a distribution rather than a value, and you cannot assert your way to confidence. A developer whose entire instinct is built around deterministic testing will reach for the wrong tools and be quietly lost the first time the model is confidently wrong.

The other half of the gap is failure modes. A generalist debugs a stack trace; an AI Python developer debugs a hallucination, a silent data drift, a retrieval step that returns plausible but irrelevant context, a cost curve that triples when an edge case loops the model. None of those throw an exception. All of them cost money. The instinct to suspect the output even when nothing crashed is learned in production, not in a bootcamp.

I am not arguing a generalist can never cross over; many of the best AI Python developers started as backend engineers and learned the probabilistic mindset on a real project. I am arguing you cannot assume the crossover happened. Hire for evidence that it did, a shipped AI feature, an eval suite they built, a postmortem on a model that degraded, not for years of Python on unrelated work.

A signal-by-signal screening table you can run

Here is how I turn those distinctions into an interview. For each signal there is something concrete to test and a clear tell that separates a strong answer from a weak one. Paste this into your hiring doc and run it.

Signal	What to test	Strong vs weak
Probabilistic mindset	"This LLM feature is right 90% of the time. How do you ship it responsibly?"	Strong: builds an eval set, defines failure tolerance, adds guardrails and fallbacks. Weak: "add more prompt instructions" and stops.
Data fluency	Hand them a messy dataframe; ask them to find and fix a leak or a bad join	Strong: inspects distributions, catches duplicated rows, vectorizes. Weak: loops over rows, trusts the first number.
Serving and async	"How do you serve this model behind FastAPI at low latency under load?"	Strong: async, timeouts, streaming, batching, caching. Weak: a synchronous endpoint that blocks on every call.
Evaluation discipline	"How would you know this feature is getting worse in production?"	Strong: production-sampled eval set, regression tracking, alerting. Weak: "we'd hear from users."
Cost and latency awareness	Ask what their AI feature costs per call and how they would cut it	Strong: token budgets, smaller models, caching, routing. Weak: never measured it.
Production scar tissue	"Tell me about an AI feature that broke after it shipped"	Strong: a specific silent failure and the fix that stuck. Weak: only demo or tutorial stories.

The pattern across every row is the same. A strong AI Python developer treats the working demo as a hypothesis to be disproven and the feature as a system to be monitored; a weak one treats the demo as the finish line. You are hiring for the first kind.

Where to find AI Python developers, and how to vet them

The strongest AI Python developers are rarely scanning general job boards; they are employed, building, and reachable through specialist communities, open-source contributions to AI and data tooling, technical writing, and referrals from people who have shipped models alongside them. A candidate who has published an honest writeup of an LLM feature that degraded is worth ten who list "AI/ML" as a skill.

Wherever you source them, the vetting bar is the same, and it is not a LeetCode loop. The single highest-signal screen is a small, paid take-home built around realistic, messy AI work: here is a dataset with a subtle problem and an LLM task with no clean answer, build something you would actually deploy and tell me what you do not trust about it. How they reason through ambiguity beats any whiteboard round. For the full screening playbook, see what an AI engineer actually does.

I once watched a team nearly pass on a quiet candidate who fumbled an algorithm puzzle, then ace the take-home by refusing to report a success rate until she had built a small eval set and found the model was failing badly on one input category that the happy-path demo never hit. They hired her. She turned out to be the best AI engineer on the team, precisely because her instinct was to distrust the output before she trusted it. The puzzle round would have screened her out; the AI-shaped exercise screened her in. The details are changed, but the lesson is not.

The mirror-image story is the senior generalist who dazzled in the interview, named every library, and shipped an LLM support feature that looked brilliant in the demo and quietly cost a fortune in production because nobody had set a token budget or noticed one query pattern was looping the model on every request. Both are composites. Both point the same direction: vet for the discipline around probabilistic systems, not the vocabulary around the libraries.

What it costs to hire a Python developer for AI

Compensation for this role is high because the talent is genuinely scarce, not because of hype. As a baseline, the average Python developer in the US earns around $112K base and roughly $128K in total compensation, with a range that runs from about $85K to $160K, per the Built In salary data. That is the generalist Python figure; developers with real AI and machine-learning depth sit at the upper end and well past it, because the both-layers skill set this article describes is rarer than either web Python or pure data science alone.

The number that gets ignored is the cost of getting it wrong. A failed senior technical hire is commonly estimated at 1.5x to 3x annual salary once you count ramp time, severance, the opportunity cost of the unbuilt roadmap, and the rehire. For a $150K-plus AI Python role, that is a six-figure mistake, and it is far more likely when you cannot evaluate the person you are hiring. The expensive part of hiring is not the salary; it is the wrong salary attached to the wrong person on your most important AI bet.

In-house versus hiring through a partner

The build-versus-partner decision is not about cost first; it is about your ability to vet and the time you have. Hiring a full-time AI Python developer into your own org is the right move when the work is core and recurring, when you can credibly evaluate the candidate, and when you can afford to wait months to fill the seat. If all three are true, hire in-house and own the capability.

The case for a partner gets strong the moment one of those conditions fails. If you cannot confidently vet an AI Python developer yourself, you are making a six-figure bet on a skill set you cannot assess, and a partner who has already done the vetting absorbs that risk. If you need someone shipping in weeks rather than months, a pre-vetting partner skips the open-market search. And if the work is real but not yet permanent headcount, an embedded specialist lets you move now without committing to a hire you might not need in a year.

This is the gap Devlyn is built to close. If you would rather not run a multi-month search and a vetting loop you are not equipped to run, Devlyn can put a pre-vetted senior Python developer for AI in front of you in 48 hours, screened for exactly the signals in the table above: data fluency, serving and async, evaluation discipline, cost awareness, and production ownership. You keep the option to convert to full-time once you have seen the work, which is a far safer way to make a senior hire than a resume and three interviews.

The honest version of this advice is that a partner is not always the answer. If AI is your core product surface for the next five years and you have the judgment to hire well, building the team yourself is the better long-term play, and my book Building an AI-Native Team is about exactly that. The partner route wins on speed, vetting risk, and optionality, which is what most teams making their first AI hire are short on.

The mistakes that sink an AI Python hire

The mistake I see most often is hiring the Python resume instead of the AI failure mode. Years of clean web development is real signal for web development and weak signal for whether someone can ship a probabilistic system. Start from the question "what must this AI feature never get wrong, and how would we know?" and hire the person whose instincts are organized around answering it.

The second mistake is an interview loop with no AI in it. If your process is two algorithm rounds and a behavioral chat, you have measured general engineering and culture and learned nothing about whether this person can build an AI feature you can trust. The interview has to contain the actual job: a messy dataset, an open-ended LLM task, a metric to interrogate, scored on reasoning rather than a clean answer.

The third mistake is ignoring the operational half of the role. An AI feature is not a deliverable; it is a system that needs evaluation, monitoring, and cost control long after the launch demo. Hire someone who has lived through a model degrading silently or a cost curve spiking, because they will build the eval set and the budget alerts from day one instead of discovering they were needed after it already cost you money. I make the broader version of this case in how to hire an ML engineer.

The fourth mistake is treating the demo as the bar. A candidate who can wire up an impressive prototype but has never owned a real evaluation loop against production data will produce demos that thrill the room and features that quietly fail at scale. The demo is table stakes; the discipline to evaluate honestly, monitor in production, and catch the silent failures is the actual job.

Frequently asked questions

What is the difference between a Python developer and a Python developer for AI?

They share a language and little else. A general Python developer ships deterministic software you can test by asserting equality. A Python developer for AI works with probabilistic systems where the same input can produce different outputs, "correct" is a distribution, and the failure modes are hallucination, drift, and runaway cost rather than a stack trace. Hire for evidence of the AI-specific mindset, a shipped feature and an eval suite they built, not for years of unrelated Python.

What skills should an AI Python developer have?

Data fluency in numpy and pandas, modeling judgment in PyTorch or with the LLM SDKs depending on your stack, async serving in FastAPI, and above all evaluation discipline, the ability to build an eval loop against real data and tell you how they would know a feature is getting worse. Type hints, tests, and reproducibility are table stakes. The framework names matter less than the judgment around probabilistic systems.

How much does it cost to hire a Python developer for AI?

In the US, the average Python developer earns around $112K base and roughly $128K total compensation, with a range of about $85K to $160K, and developers with real AI and machine-learning depth sit at the upper end and beyond. Embedded or partner engagements trade a monthly rate for speed and lower vetting risk. The bigger number to watch is the cost of a wrong hire, commonly 1.5x to 3x salary once you count ramp, opportunity cost, and rehire.

Should I hire an AI Python developer in-house or through a partner?

Hire in-house when the work is core and recurring, you can vet the candidate yourself, and you can wait months to fill the seat. Hire through a pre-vetting partner when you cannot confidently assess the skill set, you need someone shipping in weeks, or the work is real but not yet permanent headcount. A partner absorbs the vetting risk on a six-figure bet, and you can convert a strong embedded developer to full-time once you have seen the work.

If you want the broader hiring playbook this fits inside, start with my guide to hiring AI engineers and the team-design thinking in Building an AI-Native Team. And if you would rather skip the multi-month search and the vetting loop you are not equipped to run, Devlyn can put a pre-vetted senior Python developer for AI in front of you in 48 hours, screened for the data, serving, and evaluation discipline that actually predicts a feature worth shipping. Hire for the judgment around probabilistic systems. Ignore the framework list.

How to Hire a React Developer for AI Products

Alpesh Nakrani — Mon, 06 Apr 2026 18:30:00 GMT

Hire a React developer who can build AI-product frontends: streaming chat, agent interfaces, and state that survives token-by-token output, not just generic React.

If you are building an AI product, you should hire a React developer who can build for streaming, uncertainty, and failure, not just a developer who can build a clean form and a dashboard. That is the whole decision in one sentence. The market is full of strong generic React engineers, and most of them have never shipped a UI that renders a model's answer one token at a time, cancels a half-finished agent run cleanly, or shows a chat thread that does not jump around while the assistant is still thinking. Those are the skills that break an AI frontend in production, and they are exactly the skills a generic React interview will never test.

I have sat in both seats. I came up as an engineer and now run conversion and product as a CRO, so I have written the streaming code and signed off on the hire who could not. The gap between a React developer who looks great on a to-do-list demo and one who can hold a streaming chat interface together under real latency is the most expensive gap I see hiring managers miss. This piece is how I screen for it, what it costs, and where it goes wrong.

This is a supporting read under my broader guide to hiring AI engineers. If you have already decided you need a senior React developer for an AI-native frontend and want a team that has shipped this work, that is what the Devlyn React team does. The rest of this is how to make a good call whether you hire through us or anyone else.

Key takeaways

Hire for failure modes, not framerate. The skill that matters in an AI frontend is handling slow, partial, cancelable, sometimes-wrong model output, not building another pixel-perfect static page.
Streaming is the load-bearing skill. A React developer for AI products must be fluent in token streaming over SSE, cancelation, and rendering that does not thrash while text arrives.
Optimistic UI that lies is worse than a spinner. Chat and agent interfaces need state that reconciles honestly when the model corrects itself or a tool call fails.
A generic React screen will pass the wrong person. Test the streaming and cancel paths explicitly, or you are hiring on a demo that has nothing to do with your product.
Rate follows the skill, not the title. "React developer" spans a wide price band; pay for the AI-frontend experience only where your product actually needs it.

Generic React versus an AI-product frontend developer

Almost every React developer you interview will be competent at the thing React was built for: take a known state, render it, update it on events, keep it fast. For a CRUD app, an admin panel, or a marketing site, that is the whole job, the talent pool is deep and reasonably priced, and you should hire on the normal signals without overpaying for AI experience you will never use.

An AI product frontend is a different animal. The data does not arrive all at once; it dribbles in token by token over seconds, and the answer is not known to be correct, so the model may contradict itself mid-stream or call a tool that fails. The user can cancel, and your UI has to stop an in-flight request without leaking state or money. None of that is in the standard React mental model, which assumes you have the data and are just deciding how to show it.

So the real question when you hire a React developer for an AI product is not "do they know React." It is "have they built a UI on top of something slow, partial, and unreliable, and did it stay calm when the underlying thing misbehaved." That is a narrower pool. It overlaps with the broader AI engineering skill set, but on the frontend it shows up as a specific set of habits you can probe for directly.

Hire for failure modes, not framerate. The hard part of an AI frontend is the slow, partial, cancelable, sometimes-wrong output, not another pixel-perfect page.

Streaming UIs and SSE: the load-bearing skill

The single most diagnostic skill is streaming. When a model answers, you want the text to appear as it is generated, not after a ten-second blank wait, because perceived latency is most of the user experience in a chat product. That means the frontend consumes a stream, usually server-sent events, and appends tokens to the rendered output as they arrive. The official MDN guide to server-sent events documents the EventSource API and the stream format a candidate should recognize on sight.

A strong candidate will talk about the parts that are not in the happy path: what happens when the stream stalls halfway through a sentence, and how you cancel an in-flight stream with an AbortController when the user hits stop, so you stop billing for tokens nobody will read. They will know how to avoid re-rendering the entire thread on every token, which is the classic mistake that turns a smooth stream into a stuttering mess on a long conversation. On the React side, they should know how Suspense and streaming render patterns fit in, and where they do not.

A weak candidate treats the stream as a fetch that happens to be slow. They wait for the whole response, then render it, which throws away the reason to stream. The tell is whether they reach for cancelation and render isolation unprompted, because those are the scars you only get from shipping a real streaming UI.

Optimistic UI, chat, and agent interfaces

Once output streams, the next hard problem is honesty about state. Chat interfaces lean heavily on optimistic UI: you show the user's message immediately, show a pending assistant turn, and reconcile when the real data lands. Done well, it feels instant. Done badly, the UI confidently shows something that turns out to be wrong, and the user loses trust in the product the first time the optimistic state and the real state disagree.

Agent interfaces raise the bar again. An agent does not just answer; it plans, calls tools, waits, and sometimes fails partway through a multi-step run. The frontend has to render that as a legible sequence the user can follow: this step ran, this tool was called, this one is waiting, this one errored and is retrying. If you are building on top of agentic workflows, the interface is where all that hidden machinery either becomes trustworthy or becomes confusing.

A React developer who has built this will talk about reconciliation and error states before you ask. They will tell you how they roll back an optimistic update when a tool call fails, how they show a partial agent run without pretending it finished, and how they keep the thread stable when a later message corrects an earlier one. That is the difference between a chat UI that feels solid and one that feels like it is lying to you.

State, performance, and the p95 you actually feel

Streaming and agents put unusual pressure on state and rendering. A chat thread is append-heavy and long-lived, tool results arrive out of order, and the whole thing updates many times per second while tokens stream. A React developer for AI products needs a clear point of view on state management under those conditions, whether that is a server-state library like TanStack Query for the request lifecycle, a lightweight store for thread state, or plain React state used carefully. The specific tool matters less than whether they can keep updates localized so the app stays responsive.

Performance here is not about a benchmark number; it is about the latency a user feels at the 95th percentile, on a real conversation, on a mid-range laptop. The average frame is fine; it is the long thread, the slow network, and the burst of tokens that expose a sloppy implementation. I treat p95 responsiveness as a product metric, the same way I treat it on the backend, because a stuttering stream reads as a broken product no matter how good the model is underneath.

Ask a candidate how they would profile a chat UI that gets sluggish after a hundred messages. The strong answer is specific: isolate what re-renders, memoize the message list, virtualize if needed, move stream parsing off the render path. The weak answer throws a bigger state library at it or blames React.

The signals, the test, and strong-versus-weak answers

Here is how I turn all of that into an interview you can actually run. For each signal there is a concrete test and a clear contrast between a strong and a weak response. None of these require a take-home that eats a candidate's weekend; the streaming and cancel tests are a thirty-minute live exercise.

Signal	How to test it	Strong answer	Weak answer
Token streaming	Render a fake SSE stream into a chat bubble, live	Appends incrementally, isolates re-renders, handles a stalled stream	Waits for the full response, then renders it once
Cancelation	"User hits stop mid-stream. What happens?"	AbortController tears down the request; partial state is kept or cleared deliberately	"It just finishes" or leaks an orphaned request
Optimistic UI	"The tool call you optimistically showed fails. Now what?"	Rolls back cleanly, surfaces the error, keeps the thread coherent	Leaves the wrong state on screen or wipes the whole thread
Agent step UX	Sketch a multi-step agent run with one failing step	Legible per-step status: running, done, waiting, errored, retrying	One spinner for the whole run, no per-step detail
Render perf	"Chat gets laggy at 100 messages. Debug it."	Profiles first, isolates re-renders, memoizes or virtualizes	Swaps the state library or blames the framework

Where to find and vet a React developer

Sourcing depends on what you are optimizing for. Job boards and your own network get you generic React talent quickly and cheaply. For AI-frontend experience specifically, the better signal is portfolio: ask to see a streaming or chat interface they shipped, ideally live, and have them walk you through the cancel and error paths. A staffing partner that has already built AI product frontends compresses the search, which matters when the in-house pool that has done this is thin; I cover that trade-off in the hiring guide.

The vetting that actually predicts on-the-job performance is the live streaming exercise above, not a LeetCode round. I once watched a team hire a developer with a beautiful portfolio of marketing sites for a chat product; he was genuinely good, but he had never consumed a stream in his life. The first sprint produced a chat UI that waited for the entire model response before showing anything, which made a fast model feel slow and tanked the demo. The skill was real; it was just the wrong skill, and a thirty-minute streaming test would have caught it before the offer.

The contrast case is just as instructive. A developer I rate highly failed half the trivia in a screen but, handed a flaky fake stream, immediately reached for an AbortController, isolated the re-render, and asked what we wanted to happen on a mid-stream cancel before writing a line. That instinct, designing for the failure path first, is the thing I am buying. Trivia you can look up; that instinct you cannot.

A generic React screen will happily pass the wrong person. Test the streaming and cancel paths, or you are hiring on a demo that has nothing to do with your product.

What it costs to hire a React developer

"React developer" spans a very wide price band, and the AI-frontend specialization sits at the top of it. The ranges below are illustrative for orientation, not quotes; they move with region, seniority, and whether you hire full-time, contract, or through a partner, so treat them as a way to sanity-check a bid rather than a price list.

Generic mid-level React (contract): a broad, competitive band globally. Fine for CRUD, dashboards, and marketing surfaces.
Senior React with real product depth: a meaningful step up, justified when the frontend is the product, not a thin shell.
Senior React with shipped AI-frontend experience: a premium on top of that, because the streaming, cancelation, and agent-UX scars are scarce and directly de-risk your build.

The honest framing is that you pay the AI-frontend premium only where your product needs it. If your AI feature is one chat widget bolted onto an otherwise standard app, a strong generalist who can learn the streaming patterns may be the better value. If the streaming chat or agent interface is the core experience, paying for someone who has already shipped it is cheaper than paying a generalist to learn on your users. For the broader build-versus-buy math across an AI team, see my breakdown of what AI engineers cost.

Mistakes hirers make

The most common and most expensive mistake is screening for generic React and assuming AI-frontend skill comes free with it; it does not. A developer can be excellent at everything you tested and still ship a chat UI that does not stream, because you never tested streaming. The fix is cheap: add the live stream-and-cancel exercise to every loop for an AI product role.

The second mistake is hiring for framerate and polish over failure handling. A candidate who builds a gorgeous static interface is impressive in a demo and can still be the wrong hire, because the demo never exercises a stalled stream, a failed tool call, or a mid-run cancel. Those are where AI frontends actually break, and they are invisible on a happy-path showcase.

The third mistake is over-indexing on a single framework or library. The state tool, the streaming library, the component kit; these are learnable in days by a developer who understands the underlying problem. Hire for the understanding of slow, partial, unreliable output, and let the specific stack be a detail, because a developer who reasons clearly about cancelation and reconciliation will pick up your libraries far faster than a library expert will pick up that reasoning.

Frequently asked questions

What should I look for when I hire a React developer for an AI product?

Look for fluency with slow, partial, cancelable model output, not just standard React. The load-bearing skills are token streaming over server-sent events, clean cancelation with an AbortController, optimistic UI that reconciles honestly, legible agent-step interfaces, and render performance that holds up on a long, fast-updating chat thread. Generic React competence is necessary but not sufficient.

How is hiring a React developer for AI different from a normal React hire?

A normal React hire renders known data and updates it on events. An AI-product hire renders output that arrives token by token, may be wrong or get corrected mid-stream, can be canceled, and depends on tool calls that fail. The mental model is different, so the interview has to test the streaming and failure paths directly rather than assuming they transfer from CRUD experience.

How do I test a React developer's streaming skills in an interview?

Run a thirty-minute live exercise: hand them a fake server-sent-events stream and ask them to render it into a chat bubble, then ask what happens when the user hits stop mid-stream and when a tool call they optimistically showed fails. A strong candidate appends incrementally, isolates re-renders, reaches for an AbortController, and rolls back failed state cleanly. A weak one waits for the full response and renders it once.

How much does it cost to hire a React developer in 2026?

It spans a wide band. Generic mid-level contract React is broadly competitive, senior product-grade React is a meaningful step up, and senior React with shipped AI-frontend experience carries a premium because those streaming and agent-UX skills are scarce. Pay the premium only where the AI frontend is the core experience; for a chat widget on a standard app, a strong generalist who learns the streaming patterns is often better value.

If you want a team that has already shipped streaming chat and agent interfaces and can start on the failure paths instead of learning them on your users, hire a React developer through Devlyn. For the wider hiring picture, my guide to hiring AI engineers and my book Building an AI-Native Team walk through how the frontend role fits the rest of the org. Hire for the failure modes. The happy path takes care of itself.

How to Hire a Node Developer for AI Products

Alpesh Nakrani — Sun, 05 Apr 2026 18:30:00 GMT

Hire a Node developer who can build AI-product backends: streaming APIs, agent orchestration, and tool servers under real load, not just a generic CRUD API.

If you are building an AI product, you should hire a Node developer who can hold a streaming connection open, orchestrate an agent that calls tools and sometimes fails, and keep all of it stable under real concurrency, not just a developer who can build a clean REST API over a database. That is the whole decision in one sentence. The market is deep in strong generic Node engineers, and most of them have never streamed a model's answer back to a client token by token, cancelled a half-finished agent run without leaking a request, or run a tool server that an LLM calls dozens of times a second. Those are the skills that break an AI backend in production, and they are exactly the skills a generic Node interview will never test.

I have sat in both seats. I came up as an engineer and now run conversion and product as a CRO, so I have written the streaming handler and signed off on the hire who could not. The gap between a Node developer who looks great on a CRUD demo and one who can hold a streaming, agent-driven backend together under load is the most expensive gap I see hiring managers miss. This piece is how I screen for it, what it costs, and where it goes wrong.

This is a supporting read under my broader guide to hiring AI engineers. If you have already decided you need a senior Node developer for an AI-native backend and want a team that has shipped this work, that is what the Devlyn Node team does. The rest of this is how to make a good call whether you hire through us or anyone else.

Key takeaways

Hire for streaming and failure, not CRUD. The skill that matters in an AI backend is holding slow, partial, cancelable, sometimes-failing work together, not shipping another tidy REST endpoint over a table.
Streaming is the load-bearing skill. A Node developer for AI products must be fluent in streaming responses, backpressure, cancelation, and connections that stay open for the length of a generation.
Agent orchestration is a state and retry problem. Multi-step agent loops, tool calls that fail, and long-running work need idempotency and queues, not a single request handler that prays.
Tool and MCP servers are a security surface. When a model can call your code, input validation and least privilege stop being optional; they are the design.
A generic Node screen will pass the wrong person. Test the streaming, cancel, and tool-call paths directly, or you are hiring on a demo that has nothing to do with your product.

Generic Node versus a Node developer for AI backends

Almost every Node developer you interview will be competent at the thing Node became popular for: accept a request, talk to a database or a few services, shape a JSON response, send it, and do it fast under concurrency. For a SaaS API, an integration layer, or an internal service, that is most of the job, the talent pool is deep and reasonably priced, and you should hire on the normal signals without overpaying for AI experience you will never use.

An AI product backend is a different animal. The work does not complete in fifty milliseconds; a single model call can run for many seconds, and you usually want to stream the result back as it is produced rather than make the user wait for the whole thing. Requests are long-lived, not fire-and-forget. An agent may call three tools, have one fail, and need to retry or recover without corrupting state. The client can cancel, and your backend has to stop in-flight work and stop paying for tokens nobody will read. None of that is in the standard CRUD mental model, which assumes a request arrives, does a bounded amount of work, and returns.

So the real question when you hire a Node developer for an AI product is not "do they know Node." It is "have they built a backend on top of something slow, partial, and unreliable, and did it stay calm when the underlying model or tool misbehaved." That is a narrower pool. It overlaps with the broader AI engineering skill set, but on the backend it shows up as a specific set of habits you can probe for directly. It is also the exact mirror of what I look for on the frontend when I hire a React developer for AI products; the streaming problem just lands on both sides of the wire.

Hire for streaming and failure, not CRUD. The hard part of an AI backend is the slow, partial, cancelable, sometimes-failing work, not another tidy endpoint over a table.

Streaming APIs and SSE: the load-bearing backend skill

The single most diagnostic skill is streaming. When a model answers, you want tokens to flow back to the client as they are generated, not after a ten-second blank wait, because perceived latency is most of the user experience in an AI product. On the backend that means holding a connection open and writing chunks as they arrive, usually as server-sent events. The official MDN guide to server-sent events documents the stream format and the EventSource contract a candidate should recognize on sight, and the producing side of that contract is Node.

A strong candidate will talk about the parts that are not in the happy path. They will know that Node's streams API exists for exactly this, with readable, writable, and transform streams, and that backpressure is the thing that keeps a fast producer from drowning a slow consumer. They will reach for an AbortController to tear down a generation when the client disconnects, so you stop billing for an answer no one is reading. They will know how to pipe a model's token stream through a transform without buffering the entire response in memory, which is the classic mistake that turns streaming into "wait, then dump."

A weak candidate treats the model call as a slow function that returns a string. They await the whole completion, then send it, which throws away the entire reason to stream and leaves your connection idle and your user staring at a spinner. The tell is whether they reach for backpressure and cancelation unprompted, because those are the scars you only get from shipping a real streaming backend.

Agent orchestration and long-running work

Once output streams, the next hard problem is orchestrating work that does not finish in one shot. An agent does not just answer; it plans, calls a tool, waits, reads the result, decides the next step, and sometimes fails partway through. That loop can run for thirty seconds or thirty minutes. A Node developer for AI backends has to model that as durable, resumable state, not as one request handler holding everything in local variables and hoping the process does not restart.

The habits that matter here are the ones that survive partial failure. Idempotency, so a retried step does not double-charge or double-send. Queues and background workers, so a long agent run is not pinned to a single HTTP request that times out. Clear state transitions, so a run that died at step four resumes at step four rather than from the top. This is where Node's concurrency model matters; the event loop makes it cheap to hold thousands of in-flight, mostly-waiting operations, which is exactly the shape of agent work, but it also means one accidental synchronous blocking call stalls every concurrent run at once.

If you are building on top of agentic workflows, the backend is where all that hidden machinery either becomes reliable or becomes a source of mystery bugs. A strong candidate talks about retries, idempotency keys, and queue semantics before you ask. A weak one describes a single async function with a try/catch around the whole thing and no story for what happens when the process dies mid-run.

Tool servers and MCP: when the model calls your code

The newest part of this job is building the servers that models call. When an agent uses a tool, something on your side has to expose that tool, validate the arguments the model produced, run it, and return a result the model can use. Increasingly that surface is standardized through the Model Context Protocol, which gives AI applications a consistent way to connect to tools, data, and workflows, and Node is one of the most common runtimes for building those servers.

Here is the part most generic backend developers miss: a tool server is a security surface, because the caller is a model, and the model's arguments are effectively untrusted input shaped by whatever ended up in its context. A Node developer for AI backends has to validate every tool argument as if it came from the open internet, scope each tool to least privilege so a "read a file" tool cannot write or delete, and design for the model occasionally calling a tool with confidently wrong arguments. Treating model output as trusted is how you end up with an agent that drops a table because someone prompt-injected it.

A strong candidate raises validation and least privilege without being prompted, and can talk about idempotency on tool calls because a model may retry the same call. A weak candidate builds the tool server like an internal admin endpoint, assumes the arguments are well-formed, and is surprised the first time the model sends something that should never have been possible.

The signals, the test, and strong-versus-weak answers

Signal	How to test it	Strong answer	Weak answer
Token streaming	Stream a fake slow model response back to a client over SSE, live	Writes chunks as they arrive, respects backpressure, never buffers the whole answer	Awaits the full completion, then sends it once
Cancelation	"The client disconnects mid-generation. What happens?"	AbortController tears down the model call; stops token spend deliberately	"It just finishes" or leaks an orphaned request
Agent orchestration	"A three-step agent run fails on step two. Now what?"	Idempotent retry from the failed step, durable state, no double side effects	Restarts the whole run or leaves state half-written
Tool server safety	"A model calls your delete tool with odd arguments. Design it."	Validates args, least privilege, idempotent, refuses malformed input	Trusts the model's arguments and runs the operation
Concurrency	"One endpoint blocks the event loop under load. Debug it."	Finds the synchronous call, moves CPU work off the loop, measures p95	Adds more instances or blames Node for being single-threaded

Where to find and vet a Node developer

Sourcing depends on what you are optimizing for. Job boards, marketplaces, and your own network get you generic Node talent quickly and reasonably cheaply. For AI-backend experience specifically, the better signal is portfolio: ask to see a streaming endpoint or an agent runner they shipped, ideally running, and have them walk you through the cancel, retry, and failure paths. A staffing partner that has already built AI product backends compresses the search, which matters when the in-house pool that has done this work is thin; I cover that trade-off in the hiring guide.

The vetting that actually predicts on-the-job performance is the live streaming and orchestration exercise above, not a LeetCode round. I once watched a team hire a developer with a spotless record of high-throughput REST APIs for an AI chat product. He was genuinely strong, but he had never streamed a response in his life. The first sprint produced an endpoint that waited for the entire model completion before sending a byte, which made a fast model feel slow and buffered long answers until memory spiked under concurrent users. The skill was real; it was just the wrong skill, and a thirty-minute streaming test would have caught it before the offer.

The contrast case is just as instructive. A developer I rate highly fumbled half the algorithm trivia in a screen but, handed a flaky fake model stream, immediately reached for backpressure, wired up an AbortController on client disconnect, and asked what we wanted to happen to billing on a mid-stream cancel before writing a line. That instinct, designing for the failure path first, is the thing I am buying. There was a second tell I trusted just as much: when I described a tool the agent would call, his first question was how to validate the arguments, not how fast it had to be. Trivia you can look up; that instinct you cannot.

A generic Node screen will happily pass the wrong person. Test the streaming, cancel, and tool-call paths, or you are hiring on a demo that has nothing to do with your product.

What it costs to hire a Node developer

"Node developer" spans a very wide price band, and the AI-backend specialization sits at the top of it. The ranges below are illustrative for orientation, not quotes; they move with region, seniority, and whether you hire full-time, contract, or through a partner, so treat them as a way to sanity-check a bid rather than a price list.

Generic mid-level Node (contract): a broad, competitive band globally. Fine for REST APIs, integrations, and standard SaaS backends.
Senior Node with real production depth: a meaningful step up, justified when the backend carries real concurrency, reliability, and on-call weight.
Senior Node with shipped AI-backend experience: a premium on top of that, because the streaming, orchestration, and tool-server scars are scarce and directly de-risk your build.

The honest framing is that you pay the AI-backend premium only where your product needs it. If your AI feature is one streaming endpoint bolted onto an otherwise standard API, a strong generalist who can learn the streaming patterns may be the better value. If the streaming, agent orchestration, and tool servers are the core of the product, paying for someone who has already shipped them is cheaper than paying a generalist to learn on your users. For the broader build-versus-buy math across an AI team, see my breakdown of what AI engineers cost.

Mistakes hirers make

The most common and most expensive mistake is screening for generic Node and assuming AI-backend skill comes free with it; it does not. A developer can ace everything you tested and still ship an endpoint that buffers instead of streams, because you never tested streaming. The fix is cheap: add the live stream-and-cancel exercise to every loop for an AI product role.

The second mistake is ignoring failure and concurrency in favor of clean happy-path code. A candidate who writes beautiful request handlers can still be the wrong hire, because the demo never exercises a stalled generation, a failed tool call, a mid-run cancel, or a thousand concurrent long-lived connections. Those are where AI backends actually break, and they are invisible on a happy-path showcase.

The third mistake is over-indexing on a single framework or library. The HTTP framework, the queue, the model SDK; these are learnable in days by a developer who understands the underlying problem. Hire for the understanding of slow, partial, unreliable work and untrusted model output, and let the specific stack be a detail, because a developer who reasons clearly about backpressure, idempotency, and validation will pick up your libraries far faster than a library expert will pick up that reasoning.

Frequently asked questions

What should I look for when I hire a Node developer for an AI product?

Look for fluency with slow, partial, cancelable model work, not just standard Node. The load-bearing skills are streaming responses over server-sent events with backpressure, clean cancelation when the client disconnects, durable agent orchestration with idempotent retries, and safe tool or MCP servers that validate untrusted model arguments and run with least privilege. Generic Node competence is necessary but not sufficient.

How is hiring a Node developer for AI different from a normal backend hire?

A normal Node hire accepts a request, does a bounded amount of work, and returns a response quickly. An AI-product hire holds long-lived streaming connections open, orchestrates multi-step agent runs that can fail partway, exposes tools a model calls with sometimes-wrong arguments, and has to stay stable under high concurrency. The mental model is different, so the interview has to test the streaming, failure, and tool-call paths directly rather than assuming they transfer from CRUD experience.

How do I test a Node developer's streaming skills in an interview?

Run a thirty-minute live exercise: give them a fake model that emits tokens slowly and ask them to stream the response to a client over server-sent events, then ask what happens when the client disconnects mid-generation and when a tool call the agent made fails. A strong candidate writes chunks as they arrive, respects backpressure, reaches for an AbortController to stop token spend, and designs idempotent retries. A weak one awaits the full completion and sends it once.

How much does it cost to hire a Node developer in 2026?

It spans a wide band. Generic mid-level contract Node is broadly competitive, senior production-grade Node is a meaningful step up, and senior Node with shipped AI-backend experience carries a premium because the streaming, orchestration, and tool-server skills are scarce. Pay the premium only where the AI backend is the core of the product; for a single streaming endpoint on a standard API, a strong generalist who learns the patterns is often better value.

If you want a team that has already shipped streaming APIs, agent orchestration, and tool servers and can start on the failure paths instead of learning them on your users, hire a Node developer through Devlyn. For the wider hiring picture, my guide to hiring AI engineers and my book Building an AI-Native Team walk through how the backend role fits the rest of the org. Hire for the failure modes. The happy path takes care of itself.

How to Hire a Full-Stack AI Developer (Without Guessing)

Alpesh Nakrani — Sat, 04 Apr 2026 18:30:00 GMT

Hire a full-stack AI developer who owns the AI feature end to end: frontend AI UX, model integration, and the eval loop, not a generic full-stack dev who has never shipped against a model.

If you are building an AI product and you want one person to move it forward, hire a full-stack AI developer who can own the whole AI feature end to end: the frontend AI UX, the backend model integration, and the eval loop that keeps it honest. That is the decision in one sentence. The market is full of strong full-stack engineers, and most of them have shipped clean CRUD apps, dashboards, and auth flows without ever once wiring a UI to a streaming model, handling an answer that arrives wrong, or building the evals that tell you whether the feature is getting better or worse. Those last three things are what break AI products in production, and a generic full-stack interview will never test for any of them.

I have sat in both seats. I came up as an engineer and now run conversion and product as a CRO, so I have written the model-integration code and I have signed off on the hire who could not. The most expensive mistake I see hiring managers make is treating "full stack" plus "has used the OpenAI SDK once" as if it equals a person who can carry an AI feature alone; it does not. This piece is how I separate the two, what the role actually owns, when one generalist is genuinely enough, and what it costs.

This is a supporting read under my broader guide to hiring AI engineers. If you have already decided you need someone to own an AI product slice end to end and want a team that has shipped this work, that is exactly what the Devlyn full-stack AI engineering team does. Everything below is how to make a good call, whether you hire through us or anyone else.

Key takeaways

A full-stack AI developer owns three layers, not two. Frontend AI UX, backend model integration, and the eval loop. Most "full stack" candidates have only ever shipped the first two against deterministic systems.
The eval loop is the load-bearing skill. Anyone can call a model. The full-stack AI engineer is the one who can prove the feature is good enough to ship and tell you when it stops being good.
One generalist beats a team early, and only early. Below a certain volume and risk line, a single end-to-end owner moves faster than three specialists with handoffs. Above it, the line flips.
A generic full-stack screen will pass the wrong person. Test the streaming path, the failure path, and the eval path explicitly, or you are hiring on a demo that has nothing to do with your product.
Rate follows the scarce skill. "Full-stack AI developer" spans a wide price band; you are paying for production AI judgment, not another React-plus-Node resume.

What a full-stack AI developer actually owns

The phrase "full stack" used to mean frontend plus backend: someone who could build the React app and the API behind it. A full-stack AI developer owns a third layer that did not exist in the old definition, and that layer is where the job actually lives. They own frontend AI UX, backend model integration, and the evaluation loop, and they own the seams between all three.

On the frontend, AI UX is not a prettier form. It is rendering output that arrives token by token, holding a chat thread steady while the model is still thinking, canceling a half-finished agent run cleanly, and showing uncertainty honestly instead of pretending the model is sure. A full-stack AI developer who came up on static dashboards has never built a UI that has to stay coherent while the answer is still streaming in and might be wrong when it lands.

On the backend, model integration is the part everyone assumes is easy because the SDK call is three lines. The hard part is everything around it: prompt orchestration, retrieval over a vector store, retries and timeouts when the provider is slow, cost control, and the routing logic that decides which model handles which request. This is the work I described in my piece on shipping the eval loop that keeps a model honest, and it is the half of the stack that a frontend-leaning generalist tends to underestimate.

The third layer is the one that separates a real full-stack AI engineer from a competent app developer who has touched a model: they own the evals. They can define what "good enough" means for the feature, build a frozen test set sampled from real traffic, and report whether this week's change made the feature better or worse. Without that loop, you do not have an AI feature; you have a demo that works until a customer finds the edge case. The person who owns all three layers, and the seams where they fail, is the person you are actually trying to hire.

Anyone can call a model. The full-stack AI engineer is the one who can prove the feature is good enough to ship, and tell you the week it stops being good.

Full-stack AI developer vs specialist AI engineer

The honest trade is breadth against depth. A full-stack AI developer carries the whole feature but goes less deep on any one layer. A specialist, an AI-frontend React engineer or a retrieval engineer or an evals lead, goes deep on one layer and depends on others to cover the rest. Neither is better in the abstract; they are answers to different questions.

You want the full-stack AI developer when the work is a vertical slice: one feature that has to go from a streaming UI through model integration to an eval loop, owned by one person who does not lose a day to handoffs. You want a specialist when one layer is genuinely hard enough to be a full-time job, a retrieval pipeline over millions of documents, or a frontend with strict accessibility and real-time constraints, where shallow coverage would sink you.

The failure mode I see most is hiring a specialist and expecting full-stack ownership, or hiring a generalist and expecting specialist depth. A retrieval engineer who is brilliant on vector databases will build you a thin, brittle frontend if you make them own the whole feature; a strong generalist asked to scale a retrieval system to production load will ship something that works at demo scale and falls over at real scale. Match the shape of the hire to the shape of the work. For the broader role map, my breakdown of what an AI engineer actually is lays out where each specialist fits.

When one generalist beats a whole team

Early, one good full-stack AI developer beats a team, and it is not close. When you are still finding out whether the AI feature is worth building, the bottleneck is iteration speed, and every handoff between a frontend person, a backend person, and an evals person is a tax on iteration. One owner who can change the prompt, adjust the UI, and rerun the eval set in the same afternoon will out-learn a three-person team that has to coordinate to do the same thing.

Here is an illustrative shape I have seen more than once. A seed-stage team has a chat feature that works in the founder's demo and falls apart with real users, so they hire one full-stack AI developer instead of three specialists. Within a month that person has instrumented the failure cases, rebuilt the streaming UI so it stops thrashing, added a retrieval layer, and stood up a small eval set. The feature is shippable, and crucially, the company now knows what good looks like; three specialists would still have been in the kickoff meeting deciding who owns what.

The line flips on volume and risk. Once the feature carries real traffic, once a wrong answer has a real cost, once latency at the 95th percentile is a revenue number and not a vibe, the breadth that made the generalist fast becomes the thing that holds you back. Now you want depth on each layer, and the generalist becomes the person who owns the architecture and the seams while specialists go deep; they do not get fired, they get promoted into the role that keeps the specialists pointed at the same outcome. I make the longer version of this argument in my guide to hiring AI engineers, because getting the sequence right is most of the battle.

The skills and signals that actually matter

Screen for failure-mode handling, not framework bingo. The resume that lists React, Node, Python, LangChain, and a vector database tells you the person has heard of the tools. It tells you nothing about whether they can hold an AI feature together when the model is slow, partial, and occasionally wrong. The skills that matter are the ones that only show up under those conditions.

The signals I weight most heavily: can they explain how they would render a streaming response without the UI jumping around; can they describe a time the model returned garbage and what their code did about it; can they tell me how they decided their AI feature was good enough to ship, in numbers, not vibes; and can they reason about cost per request without me prompting it. A candidate who lights up on all four has shipped real AI product work. A candidate who can only talk about the happy path has built a demo. For the full inventory, my list of the skills an AI engineer needs goes deeper than I can here.

One more signal that is easy to miss: judgment about when not to use a model. The strongest full-stack AI developers I have worked with will tell you, unprompted, which parts of the feature should be a plain database query or a rule, and which genuinely need a model. The weak ones reach for the model everywhere because it is the shiny tool. The role is full-stack AI development, not full-stack AI maximalism, and that judgment is exactly what you are paying a premium for.

A signal-to-test table you can screen with

Here is the screen I actually use, compressed into one table. For each signal, there is a concrete test you can run in an interview or a paid trial, and what a strong answer looks like next to a weak one. Use it to replace the generic full-stack screen that will quietly pass the wrong person.

Signal	How to test it	Strong vs weak
Streaming UI	Ask them to render a token-by-token response and keep the thread stable	Strong: handles partial output, cancelation, and scroll without thrash. Weak: waits for the full response, then dumps it.
Failure handling	Inject a wrong or empty model response in a live exercise	Strong: detects, degrades gracefully, tells the user honestly. Weak: renders the garbage as if it were correct.
Eval ownership	Ask how they proved a past AI feature was good enough to ship	Strong: frozen set, real metrics, a threshold set before the run. Weak: "it seemed fine in testing."
Model integration	Walk through retries, timeouts, and routing on a slow provider	Strong: bounded retries, fallback, cost-aware routing. Weak: one synchronous call, no timeout.
Cost judgment	Ask the cost per request of their last AI feature	Strong: knows the number and what drives it. Weak: has never looked.
Restraint	Ask which parts of a feature should not use a model	Strong: names the rule-or-query parts unprompted. Weak: uses a model for everything.

Where to find and vet a full-stack AI developer

The pool is smaller than the job postings suggest, because most people who claim the title have shipped two of the three layers. You will find strong candidates among application engineers who moved into AI product work, among ex-founders who built an AI product solo and therefore had no choice but to own the whole stack, and among the specialists who have deliberately broadened. Marketplaces and agencies can shortcut sourcing, but they do not replace your own vetting bar; a marketplace badge tells you someone passed a generic screen, not that they can carry your feature.

Vet with work, not with trivia. The single most reliable signal is a short paid trial on a real slice of your problem: give them a streaming feature with a deliberately flaky model behind it and watch what they build. The person who instruments the failure cases and asks how you will measure success is the hire; the person who ships the happy path and calls it done is the one your generic screen would have hired. According to the Stack Overflow 2024 Developer Survey, 76% of developers are using or planning to use AI tools, up from 70% the year before, so "I have used an AI tool" is now table stakes and tells you almost nothing.

Make them prove the production skills instead. If you want a structured version of this, my guide to building the eval loop doubles as a vetting rubric: a candidate who already thinks this way is the one you want.

What a full-stack AI developer costs

The honest answer is that the rate band is wide and it tracks the scarce skill, not the title. The application-layer AI engineer role is genuinely newer than the resume keywords suggest; as the Rise of the AI Engineer essay put it, you can be very effective in this role without ever training a model, which means the supply is people who learned to build on top of foundation models, and that supply is still catching up to demand. You are competing for them against every other company building an AI product.

Practically, expect to pay above a generic full-stack rate and below a research-ML rate. A strong full-stack AI developer who can own all three layers commands a premium over a standard full-stack engineer because the eval and integration skills are scarce, but less than a research scientist who trains models, because that is a different and rarer job you probably do not need. Where the money goes is judgment: you are paying for the person who will not burn your inference budget, will not ship an unmeasured feature, and will know when the slice is big enough to need a team. For the full breakdown of bands and contractor-versus-full-time math, see my piece on what an AI engineer costs, and the book Building an AI-Native Team covers how the role fits into the org as you scale past one hire.

The mistakes that cost the most

The most expensive mistake is hiring on a demo. A candidate shows you a slick AI app they built, and you read it as proof they can own production AI work. The demo proves they can build the happy path; it says nothing about whether they handled the wrong answer, the slow provider, or the eval loop, which is exactly where production AI features fail. Test the failure paths explicitly or the demo will sell you the wrong person.

The second mistake is title-shopping: hiring the resume with the most AI keywords instead of the person who can reason about failure modes, because the keywords are free to list. The third is hiring a generalist and then never letting them build the eval loop, because leadership treats evals as a nice-to-have. That is how you end up with a feature nobody can prove is working, which is a slower, more expensive version of having no feature at all. I have written the longer catalog of these in the hiring guide; the throughline is that every one of them comes from screening for the easy half of the job.

The demo proves they can build the happy path. The job is everything that happens when the happy path breaks.

Frequently asked questions

What does a full-stack AI developer actually do?

A full-stack AI developer owns an AI feature end to end across three layers: the frontend AI UX (streaming output, cancelation, honest handling of uncertainty), the backend model integration (prompt orchestration, retrieval, retries, cost control, routing), and the eval loop that proves the feature is good enough to ship and flags when it degrades. The third layer is what separates them from a normal full-stack engineer who has called a model once.

Should I hire a full-stack AI developer or a team of specialists?

Early, when the bottleneck is iteration speed and you are still learning whether the feature works, one full-stack AI developer beats a team because there are no handoffs. Once the feature carries real traffic and a wrong answer has a real cost, the line flips and you want depth on each layer, with your best generalist owning the architecture and the seams. Match the shape of the hire to the shape of the work.

What is the difference between a full-stack AI developer and a full-stack AI engineer?

In practice the titles are used interchangeably; both describe one person who owns the AI feature across frontend, backend, and evals. If a job description draws a line, "engineer" sometimes signals deeper backend and systems ownership and "developer" sometimes signals a frontend lean, but the skills you should screen for are identical: streaming UX, robust model integration, and a real eval loop.

How do I vet a full-stack AI developer if I am not technical?

Run a short paid trial on a real slice of your problem with a deliberately flaky model behind it, and watch for three behaviors: do they instrument and handle the failure cases, do they ask how success will be measured, and do they reason about cost per request without being prompted. A candidate who does all three has shipped production AI work. A candidate who only delivers the happy path has built a demo.

If you would rather skip the search and hand the AI feature to a team that already owns all three layers in production, that is what Devlyn's full-stack AI engineering team does. And if you are building the org around this role rather than making a single hire, my book Building an AI-Native Team walks through how the full-stack AI developer fits next to the specialists as you scale. Hire for the half of the job that breaks. Screen for the rest.

How to Hire a DevOps Engineer for AI Workloads

Alpesh Nakrani — Fri, 03 Apr 2026 18:30:00 GMT

Hiring a DevOps engineer for AI is a GPU-cost and reliability bet, not a generic ops hire. Here is what the role owns, how to vet it, and what it costs.

When you hire a DevOps engineer for AI, you are hiring for one thing above all others: someone who can keep GPU-backed model workloads reliable and affordable in production. Not the longest cloud-certification list, not a resume full of CI/CD pipelines for stateless web apps. The person you want is the one who can take a model that serves correctly in a notebook and make it deploy, autoscale, observe, roll back, and stay cheap enough to run on hardware that costs more per hour than most of your other infrastructure combined. That instinct for GPU economics under production load is the scarce thing, and it is what separates a strong AI DevOps hire from an expensive generalist who treats your inference bill as someone else's problem.

I have hired and deployed senior AI and infrastructure engineers at Devlyn, and I sit in two seats at once: I read the GPU utilization dashboards and I read the P&L. From that seat, the pattern is consistent. Most teams hiring their first DevOps engineer for AI screen for generic ops skills, anchor on the wrong cost, and only discover the mismatch when a model-serving cluster sits at 12% GPU utilization and the cloud bill triples. This piece is the specialist deep-dive that branches off my pillar guide to hiring AI engineers, and it is written for the person who has already decided they need this role and wants to get it right the first time.

If you would rather not run a three-month search for a role you cannot fully vet yourself, you can buy the capability pre-vetted. That is exactly what the Devlyn DevOps engineering team exists for: senior engineers who own GPU infrastructure, model serving, and inference cost, on a transparent rate, with a trial period instead of a hiring gamble. But whether you build or buy, you need to know what good looks like, so let me give you that first.

Hire for GPU economics, not cloud breadth. The scarce skill is keeping model workloads reliable and cheap on expensive hardware, not naming the most cloud services.
AI DevOps is generic DevOps plus the hard parts. GPU scheduling, model serving, and cost per token are where this role lives, and where a generic ops hire quietly fails.
The role is defined by the inference bill and the uptime curve. Autoscaling GPU workloads, observability on model behavior, and FinOps for inference are where it earns its salary.
The cost that matters is the loaded cost plus the GPU bill it controls. A US DevOps engineer for AI runs roughly $130K to $210K base, and a weak one can cost you many times that in wasted GPU spend.
Know whether you need DevOps, MLOps, or an AI engineer. The titles overlap, the pain points do not, and matching the specialist to the problem is the decision that pays back the most.

What a DevOps engineer for AI actually owns

A DevOps engineer for AI owns the path your model workloads take from a trained or selected model to a reliable, cost-controlled production service, and everything that keeps that service healthy on GPU hardware. That is the whole job in one sentence, and the words carrying the weight are GPU and cost-controlled. A generic DevOps engineer can ship a stateless web service that scales horizontally on cheap CPU instances. The AI DevOps engineer is the person who makes sure a model serves at the latency your product needs, on hardware that costs ten to forty times more per hour, without that hardware sitting idle and burning money.

Concretely, the surface they own breaks into five areas. First, GPU infrastructure: provisioning and scheduling GPU nodes, bin-packing models onto them, managing driver and CUDA compatibility, and squeezing utilization up so you are not paying for silicon that sits idle. Second, model serving: getting a model behind an inference server like vLLM, Triton, or TGI, tuning batch sizes and concurrency, and managing the cold-start and warm-pool problem that makes GPU autoscaling genuinely hard.

Third, CI/CD for models: this is not CI/CD for code with a different label on it. Model artifacts are large, deployments are stateful, and a rollback means swapping multi-gigabyte weights without dropping in-flight requests. Fourth, cost and FinOps for inference: a DevOps engineer for AI who does not watch cost per token and GPU utilization will hand you a system that works and a bill that does not, which is why I treat inference cost as a first-class concern for this role, not an afterthought.

Fifth, observability: instrumenting not just CPU and memory but GPU utilization, token throughput, queue depth, and the latency tail that users actually feel. A model-serving system can be green on every standard ops dashboard and still be failing the product, because the metrics that matter for AI workloads are not the metrics generic monitoring ships with by default.

A model-serving cluster can be green on every standard ops dashboard and still be failing the product, because the metrics that matter for GPU workloads are not the ones generic monitoring ships with.

If you want the standard menu of tools attached to these areas, it looks like Kubernetes with the GPU operator underneath most of it, Terraform for provisioning, vLLM or Triton for serving, Prometheus and Grafana for metrics, and a cost tool layered on top. But here is the thing I tell every founder: the tools are the answer to the wrong question. The right question is whether the person can own the outcome, a reliable service at a defensible cost per token, when the tool inevitably does not do what the docs promised.

AI DevOps vs generic DevOps vs MLOps

This is the disambiguation that saves the most confusion, because the three titles overlap and the market uses them loosely. A generic DevOps engineer owns the deployment and reliability of conventional software: CI/CD pipelines, infrastructure as code, autoscaling stateless services, and incident response, almost all of it on CPU. A DevOps engineer for AI owns that same surface but for GPU-backed model workloads, where the hard problems are GPU scheduling, model serving, and inference cost rather than horizontal scaling of cheap instances.

An MLOps engineer sits adjacent and leans toward the model lifecycle: training and data pipelines, experiment tracking, model registries, drift detection, and the reproducibility of "which data and code produced this model." There is real overlap with AI DevOps on serving and monitoring, which is why teams conflate them, but the center of gravity differs. MLOps cares most about the model staying correct over time; AI DevOps cares most about the infrastructure under it staying reliable and cheap. I cover the model-lifecycle side in detail in how to hire an MLOps engineer, and the honest truth is that at small scale one strong person often covers both.

The practical rule: if your pain is "our model is wrong or drifting," that is MLOps. If your pain is "our model is right but our GPU bill is insane and the serving layer keeps falling over," that is DevOps for AI. If your pain is "we cannot get a good enough model at all," you need an AI or ML engineer, and the skills that actually matter there are different again. Hiring the wrong specialist for your actual pain is the most expensive mistake in this whole space.

The skills and signals that separate a strong hire from a weak one

The strongest DevOps engineers for AI I have worked with share a trait that does not appear on any certification: they think in GPU dollars. Ask one how they would deploy a new model and a weak candidate describes a generic Kubernetes rollout; a strong one immediately starts asking what the request pattern looks like, whether you can batch, what GPU type fits the model, and what utilization you are getting today. That instinct to reason about the most expensive resource first is the single best predictor of a hire who will save you money rather than cost it.

The second signal is whether they understand the model-serving layer rather than treating it as a black box behind a load balancer. Anyone can put a service behind an ingress. The engineer you want knows why continuous batching changes your throughput, why cold starts on GPU autoscaling are a real product problem, and how to keep a warm pool sized so you are neither paying for idle GPUs nor dropping requests during a spike. This is genuinely harder than CPU autoscaling, and a candidate who has never felt that pain will not see it coming.

The third signal is failure-mode thinking applied to cost. Ask a strong candidate what happens when traffic doubles overnight, and they will talk about both reliability and the bill, because for GPU workloads those are the same conversation. A weak candidate optimizes for uptime alone and hands you a system that never falls over and quietly costs three times what it should. The discipline I want is the same one I describe in the gap between offline and online evaluation: a system that passed every check in staging can still fail, on cost or on latency, once real traffic hits it.

A screening table you can run in an interview

Signal	Test	Strong	Weak
GPU cost instinct	"This model serves fine but costs $30K a month on GPU. What do you do?"	Asks for utilization first, then proposes batching, quantization, right-sizing, or routing, ties each to the bill	Suggests a bigger instance or treats cost as finance's problem
Serving depth	"Walk me through standing up a model behind an inference server."	Talks batch size, concurrency, warm pools, and cold-start handling without prompting	Describes a generic deploy behind a load balancer; no GPU specifics
Autoscaling under spikes	"Traffic doubles in five minutes. What happens to your GPU fleet?"	Reasons about warm capacity, scale-up latency, and the cost of headroom vs dropped requests	Assumes GPU nodes scale as fast as CPU pods
Observability for AI	"Everything is green on the dashboard but users say it is slow. Find it."	Goes to GPU utilization, queue depth, token throughput, and the p95 latency tail	Re-checks CPU and memory and is stuck when they look fine
Scope honesty	"Where does your lane end and MLOps or the AI engineer begin?"	Draws a clear line and names an honest gap	Claims to own infra, model lifecycle, and modeling all at once

None of these tests requires a take-home or a whiteboard algorithm. They require the candidate to reason out loud about GPU-backed production, which is the only environment that matters for this role. If you cannot run these tests confidently yourself because you do not have an infrastructure background, that is a signal in itself, and we will come back to what to do about it.

Where to find and vet a DevOps engineer for AI

The sourcing channels are the usual ones: your network first, then specialist communities, then platforms. The engineers worth hiring tend to cluster around the open-source serving and infrastructure tools they actually use, the vLLM and Triton communities, Kubernetes GPU operators, people active in inference-optimization and FinOps-for-AI circles. Job boards and general recruiters will send you volume; the volume will be heavy on generic DevOps resumes and light on the GPU-cost thinkers you actually want.

The real problem is not finding candidates. It is vetting them. DevOps for AI sits at the intersection of infrastructure engineering, GPU economics, and model serving, which means a generalist interviewer can be fooled in both directions, by a strong cloud engineer who has never run a GPU fleet, and by a model enthusiast who has never owned reliable infrastructure. The screening table above is your defense, but it only works if someone on your side can tell a real answer from a confident one.

This is where most first-time hirers get burned, and it is the honest case for buying the capability pre-vetted rather than building it cold. If you cannot evaluate the candidate yourself, you are gambling on a three-month search for a role whose failure modes you cannot see, and the cost of getting it wrong shows up as a GPU bill, not a missed sprint. Buying pre-vetted capacity through a dedicated DevOps engineer for AI moves the vetting risk off your plate and onto a team that runs this rubric for a living. I make the full build-versus-buy argument in the pillar guide; for AI infrastructure specifically, the asymmetry is sharper because idle GPUs cost real money every hour you get it wrong.

What a DevOps engineer for AI costs in 2026

Let me give you the salary line first, because it is the number everyone anchors on, and then explain why it is the wrong number to anchor on. In the US in 2026, general DevOps engineer base salaries run roughly $81K at entry level to $175K and up for senior roles (kore1). DevOps engineers with genuine AI and GPU-infrastructure depth sit at the top of that band and overlap with MLOps comp, which runs roughly $90K to $257K depending on seniority and market (kore1). Call it a $130K to $210K base for the role as most US teams scope it, with senior specialists at top employers climbing higher in total compensation. Offshore and nearshore, the same capacity costs meaningfully less on the rate card.

But the salary line is the smallest part of the true cost, and this is the same lesson I lay out in detail on what an AI engineer actually costs. Add benefits, taxes, equipment, and tooling and a $180K base becomes a loaded cost well north of $230K before the person has saved a single GPU-hour. Then add ramp: a new DevOps engineer for AI needs to learn your stack, your models, and your traffic patterns before they can safely tune the serving layer, and that is weeks to months at partial capacity.

The cost that actually matters is the one nobody quotes you: the GPU bill this person controls. A strong hire who pushes a serving cluster from 15% to 60% GPU utilization can save more in a quarter than their annual salary; a weak hire who leaves it at 15% costs you that delta every single month, silently, as a line item nobody questions. Optimize for cost per reliably-served request, not cost per hour, because the cheapest hour and the cheapest outcome are almost never the same person.

Three ways these hires fail (and how to avoid them)

I will keep these illustrative and NDA-safe, but the patterns are real and I have watched each of them play out more than once.

The generic-ops transplant. A team hired a strong DevOps engineer from a web-app background to run their model-serving infrastructure. He stood up Kubernetes beautifully and the service never went down, but he deployed each model to its own GPU node with no batching and no bin-packing, because that is how he had always run stateless services, and the cluster sat at single-digit utilization. The bill was four times what it needed to be, and nobody questioned it because uptime was perfect. The fix was not more reliability; it was the GPU-economics instinct the interview never tested for.

The cold-start surprise. A team set up GPU autoscaling to save money during quiet hours, copying a pattern that works fine for CPU workloads. When traffic spiked, new GPU nodes took minutes to come up and load multi-gigabyte weights, and users hit timeouts during exactly the moments that mattered most. The engineer had assumed GPU nodes scale as fast as CPU pods. A warm pool and a realistic scale-up budget would have caught it; the assumption that AI infrastructure behaves like web infrastructure did not.

The blind-spot dashboard. A team had every standard ops metric instrumented, CPU, memory, request count, error rate, all green, yet users complained the product felt slow anyway. The serving layer was queueing requests behind saturated GPUs, and queue depth and token throughput were never on the dashboard because the monitoring was built for a generic web service, so the latency tail stayed invisible until a customer surfaced it. Observability for AI workloads has to be designed for GPUs and tokens, not inherited from CPU-era defaults.

Each of these is avoidable with the screening rubric above and an honest read on whether you have the in-house ability to vet. When you do not, the lower-risk move is to engage a team that has already absorbed these lessons. That is the argument for working with a pre-vetted DevOps engineer for AI rather than running the gauntlet yourself, especially for your first hire in this function.

Frequently asked questions

What does a DevOps engineer for AI do, in one sentence?

A DevOps engineer for AI owns the infrastructure that takes model workloads from a trained or selected model to a reliable, cost-controlled production service on GPU hardware: provisioning and scheduling GPUs, model serving, CI/CD for model artifacts, inference cost and FinOps, and observability tuned for utilization and the latency tail. The defining concern is keeping expensive hardware reliable and cheap at the same time.

How is a DevOps engineer for AI different from a generic DevOps engineer?

A generic DevOps engineer is excellent at deploying and scaling conventional software, almost all of it on cheap CPU. A DevOps engineer for AI owns that same reliability surface but for GPU-backed model serving, where the hard problems are GPU scheduling, batching, cold starts, and cost per token rather than horizontal scaling of stateless instances. A strong generic engineer can grow into the role, but the GPU-economics instinct is what you are actually hiring for.

Do I need a DevOps engineer for AI or an MLOps engineer?

If your pain is that the model is wrong, drifting, or hard to reproduce, you need MLOps. If your pain is that the model is correct but the serving layer keeps failing or the GPU bill is out of control, you need DevOps for AI. The two overlap on serving and monitoring, and at small scale one strong person often covers both before you split into specialists as volume grows.

How much does it cost to hire a DevOps engineer for AI in 2026?

In the US, expect roughly a $130K to $210K base for the role as most teams scope it, overlapping the senior end of general DevOps and the AI-adjacent infrastructure band, with senior specialists higher in total comp. But the loaded cost, including benefits, ramp, and the risk of a bad hire, is far higher than the salary line, and the GPU bill this person controls dwarfs their comp, so budget for the outcome, not the rate card.

If you want the full picture on building the team around this hire, my book Building an AI-Native Team covers the role mix end to end, and the pillar guide to hiring AI engineers connects it to the rest of the cluster. And if you would rather have GPU infrastructure and model serving owned from day one without the hiring risk, that is exactly what Devlyn's DevOps engineers for AI are for. Hire for the GPU bill and the bad day. The good day takes care of itself.

How to Hire a Data Engineer (the AI Foundation)

Alpesh Nakrani — Thu, 02 Apr 2026 18:30:00 GMT

How and where to hire a data engineer for AI, the skills and signals to screen for, what it costs, and when to hire through a partner instead of building in-house.

To hire a data engineer who actually moves your AI roadmap forward, screen for someone who treats pipeline reliability, data quality, and retrieval-ready data as the job, not the plumbing nobody wants to own, and source them through specialist networks or a partner that pre-vets for production experience rather than a general job board. If you cannot vet the candidate yourself, the fastest path is to hire through a partner who can put a pre-vetted senior data engineer in front of you in days, instead of the months an open-market search for this role currently takes.

I have sat on both sides of this table. I started as an engineer, and I now run revenue at Devlyn, where I hire and deploy data engineers into AI products that touch paying customers. So I will skip the recruiter platitudes and tell you what separates a data engineer who builds the foundation your AI system stands on from one who hands you a brittle pipeline that breaks the first time the source schema shifts. This is the data-specialist deep dive under my broader guide to hiring AI engineers.

Key takeaway: No AI works on bad data. A data engineer is the foundation hire, not the afterthought. Screen for pipeline reliability, data-quality instinct, and retrieval-ready thinking, not tool bingo on a resume.
Data engineering for AI is not the same job as classic ETL. Retrieval-ready data needs source context, permissions, freshness, and quality evidence that record-moving pipelines never had to carry.
The interview must contain real, dirty data. If your loop is a SQL puzzle and a culture chat, you are screening for the wrong job. Hand them a pipeline that silently drops rows and watch whether they catch it.
Cost tracks scarcity, not hype. Data engineers run roughly $126K base and $150K total comp in the US, and the wrong hire costs far more than the right salary.
The build-vs-partner decision hinges on one question: can you vet this person yourself? If you cannot, hiring through a pre-vetting partner is faster and cheaper than a wrong full-time hire.

What a data engineer actually owns for AI

A data engineer builds and runs the pipelines that get clean, fresh, trustworthy data to everything downstream, including every AI system you will ever ship. That is the whole job, and it is the job most teams discover they needed only after their AI project stalls. The model was never the bottleneck. The data feeding it was.

I have watched this pattern enough times to call it a law: no AI works on bad data. You can swap models, tune prompts, and rebuild your retrieval stack all you want, but if the pipeline is feeding stale records, duplicated rows, or documents nobody has permission to surface, the system will fail in ways that look like a model problem and are actually a data problem. The data engineer owns the layer everything else stands on, which is exactly why it is the foundation hire for any serious AI effort.

For AI specifically, the role goes beyond classic data warehousing. Retrieval-ready data is data that carries its source, its permissions, its freshness, and evidence of its quality, so a retrieval system can return the right chunk to the right user without leaking anything it should not. That is ingestion, parsing, chunking, metadata, embeddings, lineage, and refresh workflows, not just moving records from one table to another. A data engineer who has only built reporting pipelines is not automatically ready for this; the AI surface raises the bar on freshness and governance in ways a dashboard never did.

Then there is the flywheel, which is the reason data engineering compounds. Better data produces a better model, a better model drives more usage, more usage produces more data, and the loop tightens. That flywheel only spins if someone owns the data layer well enough to keep the inputs clean as volume grows. Hire a weak data engineer and the flywheel spins the wrong way: bad data trains a worse model, the worse model loses usage, and the loop runs down. The data engineer is the person who decides which direction it turns.

No AI works on bad data. The model was never the bottleneck. The data feeding it was.

Data engineer vs ML engineer vs MLOps engineer

These three roles get conflated constantly, and hiring the wrong one is how teams end up with a person who is genuinely good at a job that is not the one they have. Let me draw the lines cleanly.

A data engineer owns the data layer: pipelines, ingestion, transformation, quality, and the governed datasets that everything downstream depends on. They make data dependable and retrieval-ready. An ML engineer owns the modeling layer: features, training, validation, and the drift monitoring that catches a model rotting in production. An MLOps engineer owns the operational layer: deployment, serving infrastructure, CI/CD for models, and the reliability of the system at runtime.

The simplest way to keep them straight is by the question each one answers. The data engineer answers "can we trust the data?" The ML engineer answers "is the model correct?" The MLOps engineer answers "will it stay up and fast under load?" Those are different jobs with different failure modes, and a strong candidate for one is often only adequate at the others.

For a first AI hire, the order usually matters more than teams expect. If your data is a mess, an ML engineer will spend their first quarter doing data engineering badly, and an MLOps engineer will have nothing reliable to deploy. The data engineer is frequently the highest-leverage first hire precisely because the other two roles are blocked without one. I lay out the full role taxonomy in the skills that actually separate the good ones.

The skills and signals to screen for

The skill that predicts success in this role better than any other is pipeline reliability thinking. A strong data engineer assumes every source will eventually send malformed data, every upstream schema will eventually change without warning, and every job will eventually fail at 3am, and they build for those facts from the start. If a candidate talks about pipelines as if they run cleanly forever, they have not yet operated one that did not.

The second signal is data-quality instinct. Ask how they would know a pipeline is silently corrupting data, and a strong one will not say "we'd check the logs." They will talk about data contracts, freshness checks, row-count and distribution monitoring, null-rate alerts, and reconciliation against a source of truth. That instinct is the difference between someone who notices the numbers drifted before finance does and someone who finds out when a stakeholder does.

The third signal is retrieval and ingestion literacy, which is where AI data engineering departs from the classic role. A candidate ready for AI work can talk through document ingestion, chunking strategy, metadata that survives retrieval, embedding refresh, and access control that travels with the data. If they have only built batch pipelines into a warehouse, they can learn this, but you need to know that going in rather than discovering the gap after the RAG system returns documents the user was never allowed to see.

The fourth signal is governance and lineage discipline, and the fifth is simply production scar tissue. Lineage means they can tell you where any number came from and what it depends on, which matters enormously the day a downstream metric looks wrong. Scar tissue means they have owned a pipeline through a real incident, watched a silent failure cost the business, and built the monitoring that would have caught it. Production changes how someone thinks, because production is where you learn the pipeline is never finished, only monitored.

A signal-by-signal screening table you can run

Signal	What to test	Strong vs weak
Pipeline reliability	"A nightly job fails silently for a week. How would you have known on day one?"	Strong: freshness checks, row-count and distribution alerts, idempotent retries. Weak: "we'd see it in the logs eventually."
Data-quality instinct	Hand them a dataset that quietly drops 3% of rows; ask if they trust it	Strong: reconciles against source, checks null rates and distributions, finds the leak. Weak: runs the query and reports the number.
Retrieval-ready thinking	"How would you prep these documents so a RAG system returns the right chunk to the right user?"	Strong: chunking, metadata, freshness, access control that travels with the data. Weak: "load them into a vector DB."
Schema-change handling	"An upstream API changes a field type overnight. What happens to your pipeline?"	Strong: data contracts, schema validation, fail-loud over fail-silent. Weak: "it would break and we'd fix it."
Lineage and governance	"A finance number looks wrong. Trace it back."	Strong: documented lineage, can name every transform and source. Weak: "I'd dig through the SQL."
Production scar tissue	"Tell me about a pipeline that broke in a way that cost the business"	Strong: a specific silent failure and the monitoring that fixed it. Weak: only greenfield or tutorial stories.

The pattern across every row is the same. A strong data engineer assumes the data is wrong until proven otherwise and builds the system to catch its own failures; a weak one assumes the data is fine until someone complains. You are hiring for the first kind.

Where to find data engineers (and how to vet them)

The supply problem is real, so where you look matters. The strongest data engineers are rarely scanning general job boards; they are employed, building, and reachable through specialist communities, open-source contributions to data tooling like dbt, Airflow, Dagster, and the ingestion ecosystem, technical writing, and referrals from people who have shipped pipelines alongside them. A candidate who has published an honest post-mortem on a pipeline that quietly corrupted data is worth ten who list "data engineering" as a skill.

Wherever you source them, the vetting bar is the same, and it is not a SQL trivia round. Algorithmic puzzles tell you nothing about whether someone can spot a silent data-quality failure or design a pipeline that fails loudly instead of quietly. The single highest-signal screen is a small, paid take-home built around realistic, dirty data: here is a pipeline with a subtle leak and a source that occasionally sends malformed records, build something you would actually deploy and tell me what you do not trust about it. How they reason through that beats any whiteboard round.

I once watched a team nearly pass on a quiet candidate who fumbled the SQL-optimization trivia, then ace the take-home by refusing to sign off on a pipeline until she had found that a timezone bug was silently double-counting events at the day boundary. They hired her. She turned out to be the best data engineer on the team, precisely because her instinct was to distrust the numbers before she trusted them. The trivia round would have screened her out; the data-shaped exercise screened her in. The details are changed, but the lesson is not.

The mirror-image story is the candidate who dazzled in the interview, named every tool in the modern data stack, and shipped a pipeline that looked clean until an upstream schema change three weeks later started feeding nulls into the model's most important feature, and nobody noticed for a month because there was no monitoring. Both are composites. Both point the same direction: vet for the discipline around the data, not the vocabulary around the tools.

What it costs to hire a data engineer

Compensation for this role is meaningful because the talent is genuinely scarce, not because of hype. As of 2026, the average data engineer in the US earns around $126K base and roughly $150K in total compensation, with senior engineers averaging about $143K, per the Built In salary data. At the top end, data engineers at large tech companies with AI-platform experience run well past that once stock is included. Those numbers price the gap between someone who can write a query and someone who can build a data foundation an AI system can stand on. I break the full picture down in what an AI engineer costs.

The scarcity behind those numbers is structural and durable. The data roles that feed this discipline are among the fastest-growing in the economy: the Bureau of Labor Statistics projects data-scientist employment to grow about 36 percent over the 2023 to 2033 decade, far outpacing the average occupation (R&D World, citing BLS). Demand that outruns supply by that margin is exactly why time-to-hire on the open market stretches into months for a strong data engineer.

The cost that gets ignored is the cost of getting it wrong. A failed senior technical hire is commonly estimated at 1.5x to 3x annual salary once you count ramp time, severance, the opportunity cost of the unbuilt roadmap, and the rehire. For a $150K data role, that is a $225K to $450K mistake, and it is far more likely when you cannot evaluate the person you are hiring. The expensive part of hiring is not the salary; it is the wrong salary attached to the wrong person, feeding bad data into everything downstream.

In-house vs hiring through a partner

The build-vs-partner decision is not about cost first; it is about your ability to vet and the time you have. Hiring a full-time data engineer into your own org is the right move when the data layer is core and recurring, when you can credibly evaluate the candidate, and when you can afford to wait months to fill the seat. If all three are true, hire in-house and own the capability.

The case for hiring through a partner gets strong the moment one of those conditions fails. If you cannot confidently vet a data engineer yourself, you are making a six-figure bet on a skill set you cannot assess, and a partner who has already done the vetting absorbs that risk. If you need someone shipping in weeks rather than months, a pre-vetting partner skips the multi-month open-market search. And if the work is real but not yet a permanent headcount, an embedded specialist lets you move now without committing to a hire you might not need in a year.

This is the gap Devlyn is built to close. If you would rather not run a multi-month search and a vetting loop you are not equipped to run, Devlyn can put a pre-vetted senior data engineer in front of you in days, screened for exactly the signals in the table above: pipeline reliability, data-quality instinct, retrieval-ready thinking, schema-change handling, and lineage discipline. You keep the option to convert to full-time once you have seen the work, which is a far safer way to make a senior hire than a resume and three interviews. If your need is the broader data foundation rather than one seat, Devlyn's AI data engineering work covers the full layer, from ingestion to governed, retrieval-ready datasets.

The honest version of this advice is that a partner is not always the answer. If the data platform is your core product surface for the next five years and you have the judgment to hire well, building the team yourself is the better long-term play, and my book Building an AI-Native Team is about exactly that. The partner route wins on speed, vetting risk, and optionality, which is what most teams making their first data hire are short on.

The mistakes that sink a data hire

The mistake I see most often is treating the data engineer as plumbing and hiring the cheapest person who knows the tools. Data engineering is the foundation your AI system stands on, and a weak foundation does not announce itself; it shows up months later as a model that quietly degraded because its inputs rotted. Start from the question "what data must this system never get wrong, and how would we know?" and hire the person whose instincts are organized around answering it.

The second mistake is an interview loop with no real data in it. If your process is two SQL rounds and a behavioral chat, you have measured query skill and culture and learned nothing about whether this person can build a pipeline you can trust. The interview has to contain the actual job, which means a messy dataset to reconcile or a silently failing pipeline to diagnose, scored on reasoning rather than a clean answer.

The third mistake is hiring a classic data engineer for an AI job without checking for the AI-specific gap. A brilliant warehouse-and-reporting engineer may have never built retrieval-ready data, handled embedding refresh, or carried access control through to a retrieval layer. That gap is learnable, but only if you know it exists before you hand them the AI roadmap. The freshness and governance bar for AI data is higher than the bar for a dashboard, and the day a model evaluation surfaces a hallucination traced back to a stale document is the wrong day to discover the gap.

The fourth mistake is ignoring the operational half of the role. A pipeline is not a deliverable; it is a system that needs monitoring, ownership, and on-call attention long after the launch. Hire someone who has lived through a pipeline failing silently, because they will build the freshness checks and quality alerts from day one instead of discovering they were needed after the bad data already trained a worse model.

Frequently asked questions

How do I hire a data engineer if I cannot evaluate the skills myself?

Hire through a partner that pre-vets for production data experience, or bring in a trusted senior practitioner to run your technical screen. Making a six-figure bet on a skill set you cannot assess is the single most expensive way to hire, and a pre-vetting partner exists precisely to absorb that risk. You can convert a strong embedded engineer to full-time once you have seen real work, which beats hiring on a resume and three interviews.

What is the difference between a data engineer and an ML engineer?

A data engineer owns the data layer: pipelines, ingestion, quality, and the governed, retrieval-ready datasets everything downstream depends on, and answers "can we trust the data?" An ML engineer owns the modeling layer: features, training, validation, and drift monitoring, and answers "is the model correct?" For a team whose AI project is stalling on messy or stale data, the data engineer is usually the higher-leverage first hire, because the ML engineer is blocked without one.

How much does it cost to hire a data engineer?

In the US as of 2026, the average data engineer earns around $126K base and roughly $150K total compensation, with senior engineers averaging near $143K, and AI-platform specialists at large tech companies running higher once stock is counted. Embedded or partner engagements trade a monthly rate for speed and lower vetting risk. The bigger number to watch is the cost of a wrong hire, commonly 1.5x to 3x salary once you count ramp, opportunity cost, and rehire.

What is the single best screening signal for a data engineer?

Whether they distrust the data until it proves itself. The strongest data engineers assume every source will eventually send bad records and every pipeline will eventually fail silently, so they build freshness checks, distribution alerts, and data contracts that fail loudly before a stakeholder notices. A take-home around realistic, slightly-broken data surfaces that instinct faster than any whiteboard round.

If you want the broader hiring playbook this fits inside, start with my guide to hiring AI engineers and the team-design thinking in Building an AI-Native Team. And if you would rather skip the multi-month search and the vetting loop you are not equipped to run, Devlyn can put a pre-vetted senior data engineer in front of you in days, screened for the pipeline and data-quality discipline that actually predicts a foundation worth building on. No AI works on bad data. Hire the person who owns that.

How to Hire a Forward Deployed Engineer

Alpesh Nakrani — Wed, 01 Apr 2026 18:30:00 GMT

A forward deployed engineer embeds with your customer and turns an unclear AI business case into a shipped solution. Here is when you need one, how to vet, and what it costs.

If you want to hire a forward deployed engineer, you are usually trying to solve one specific problem: you have a customer with an AI use case that nobody has defined cleanly, and you need one person who can sit next to that customer, figure out what actually moves their numbers, and ship working software against it. A forward deployed engineer is a customer-embedded engineer who turns an unclear business case into a deployed solution. That is the whole job. They are not a backend engineer you point at a ticket, and they are not a sales engineer who runs demos. They live in the gap between what the customer thinks they want and what will actually work in production, and they close that gap with code.

I have spent the last few years embedding engineers with customers at Devlyn, watching some of those placements turn into the most valuable hires the customer ever made and watching others quietly fail in the first month. The difference is almost never raw engineering talent. It is whether the person can tolerate ambiguity, earn a stranger's trust in a week, and ship something imperfect on Friday that the customer can react to on Monday. This piece is the hirer's guide to the role: what an FDE owns, how it differs from the adjacent titles you already know, when you actually need one, the signals that predict success, where to find them, what they cost, and the mistakes that waste the hire.

If you have already decided you need this role and you would rather not spend three months building a hiring loop from scratch, you can hire a forward deployed engineer through Devlyn and have someone embedded with your customer in weeks. If you are still working out whether this is the right role at all, keep reading.

An FDE owns the outcome, not a ticket. The job is to turn an unclear AI business case into a shipped, working solution sitting next to the customer, not to implement a spec someone else wrote.
Ambiguity tolerance is the load-bearing trait. Most strong engineers who fail in this role fail because they need a clean spec before they can move, and the spec never arrives.
It is not a sales engineer or a consultant. An FDE writes production code and ships it; a solutions engineer sells and demos; a consultant advises and leaves a deck.
Vet with a real ambiguous problem, not a LeetCode loop. The work sample should hand them a vague goal and a messy stakeholder, then watch how they narrow it.
You need one when the deal is real but the requirements are not. Clear specs do not need an FDE; pilots that have to become production do.

What a forward deployed engineer actually is, and owns

The term comes from Palantir, which built its entire delivery model around engineers embedded inside customer organizations rather than sitting back at headquarters waiting for requirements. Palantir popularized the title, and the pattern spread because it solved a problem that traditional software delivery could not: in complex, messy environments, the requirements only become clear once an engineer is in the room watching the real work happen (Wikipedia). The same source notes that comparable functions exist under other names at OpenAI, Google, and AWS, which tells you the role is converging into a recognized category rather than staying a Palantir quirk.

What an FDE owns is the path from "the customer has a vague AI ambition" to "there is software in production doing the thing." That means they own discovery, where they sit with the customer and figure out what problem is actually worth solving. They own the build, writing real production code against the vendor platform or stack. They own the deployment, including the unglamorous parts: the SSO integration, the data access permissions, the stakeholder who needs to sign off. And they own the relationship enough that when something breaks at 9pm, the customer calls them and not a support queue.

AI is why this role spiked. A traditional software integration has a knowable shape; an AI deployment does not, because the customer rarely knows whether their data supports the outcome they want until someone builds an evaluation harness and measures it. Reported demand for the role has climbed steeply across AI-native companies over the last year, and while the exact multiples cited in recruiting writeups vary and should be treated as directional rather than precise, the direction is not in doubt. The role is rising fast because AI deployments are uncertain by nature, and uncertainty is exactly what this role is built to absorb.

An FDE lives in the gap between what the customer thinks they want and what will actually work in production, and they close that gap with code.

FDE vs solutions engineer vs consultant

This is the distinction that trips up most hiring managers, because the titles overlap on the surface and diverge completely in practice. Get it wrong and you will hire the wrong person for a real need, or set the right person up to fail by judging them against the wrong scorecard.

A solutions engineer, sometimes called a sales engineer, sits on the revenue side. Their job is to win the deal: scope demos, answer technical objections, design the architecture the customer will buy. They are excellent at the pre-sale technical conversation and they hand off after the contract signs. A solutions engineer who is forced to own a six-week build in a customer's codebase is usually miserable and slow, because that was never the job they optimized for.

A consultant advises. They come in, assess, produce recommendations, and leave a document behind. Good consultants are genuinely valuable for strategy, but the deliverable is the advice, not the running system. When the engagement ends, the code, if there is any, is rarely production-grade and rarely maintained.

A forward deployed engineer ships. They are post-sale, they write production code, and they stay until the thing works in the customer's environment with real users. They borrow the customer empathy of a solutions engineer and the problem-framing of a consultant, but the output is software that survives contact with production. If you remember one thing: a solutions engineer sells it, a consultant explains it, an FDE ships it.

When you actually need a forward deployed engineer

You need this role when the deal is real but the requirements are not. If a customer has signed or is close to signing, the use case is high value, and yet nobody can write you a clean spec because the problem is genuinely fuzzy, that is the FDE signal. The classic shape is a pilot that has to become production. Pilots are forgiving; production is not, and the gap between them is where most enterprise AI projects quietly die.

An FDE is the person who walks that gap on purpose. They are comfortable starting before the destination is clear, and they treat the customer's environment as the spec it never wrote down.

You do not need an FDE when the spec is clean. If you can hand an engineer a well-defined ticket and a stable interface, a normal product or platform engineer is cheaper and a better fit. Forcing an FDE-shaped person onto clean, well-scoped work tends to bore them and waste the premium you paid for their ambiguity tolerance. Matching the role to the situation matters as much here as in any hire, which is the same logic I lay out in when to hire an AI engineer.

You also need to be honest about whether you want this capability in-house or through a partner. Building an FDE practice internally means recruiting a rare profile, building a vetting loop for it, and carrying the cost between engagements. Renting it through a partner gets you a vetted person embedded quickly without the standing overhead. I walk through that tradeoff in depth in in-house vs outsourced AI; the short version is that the partner route usually wins until you have enough sustained customer-embedded demand to keep an internal bench busy.

The skills and signals that matter

Three traits do almost all the predictive work for this role, and only one of them is technical. The technical bar is real but it is a floor, not the differentiator. The candidate needs to ship production AI software competently: RAG pipelines, evaluation harnesses, agent plumbing, the observability to know when the thing is misbehaving. You can probe most of that with the same approach I describe in the AI engineer skills guide.

Above that floor, three things separate a great FDE from a great engineer who will fail in the seat. Only one of them shows up on a resume, and it is not the one that decides the outcome.

Customer empathy. The FDE has to read a room full of nervous stakeholders, figure out who actually holds the budget and the veto, and translate vague business anxiety into a technical problem worth solving. An engineer who treats the customer as a source of requirements rather than a partner will get a clean-sounding spec and build the wrong thing perfectly.

Ship-fast bias. The instinct to put something imperfect in front of the customer this week, learn from their reaction, and iterate, beats the instinct to disappear for a month and return with a polished system the customer no longer wants. Shipping honestly and early is the discipline I argue for in building agents honestly; in the embedded context it is survival, because the customer's trust decays every week they cannot see progress.

Ambiguity tolerance. This is the one that matters most and the one interviews are worst at measuring. The FDE has to start moving before the problem is defined, hold several plausible directions in their head at once, and narrow toward the right one through action rather than waiting for clarity that will never arrive. Most strong engineers who fail in this role fail here. They are not weak; they are spec-dependent, and the spec is the one thing this job will never hand them.

Here is how I turn those traits into something you can actually test in an interview loop rather than guess at from a resume.

Signal	How to test it	Strong answer	Weak answer
Ambiguity tolerance	Hand them a vague goal with no spec and watch them start	Asks two sharp questions, then proposes a first slice to ship and learn from	Asks for a full spec before committing to anything
Customer empathy	Role-play a defensive stakeholder who keeps changing the ask	Finds the real fear under the ask, reframes the problem, builds trust	Argues with the stakeholder or silently builds the literal request
Ship-fast bias	Ask what they would put in front of the customer in week one	A thin, honest slice that proves or kills the core assumption	A multi-week plan with no customer-visible output until the end
Production judgment	Probe a past deployment for what broke and how they knew	Names the failure mode, the metric that caught it, the fix	"It worked in testing" with no production failure story
Ownership	Ask who they called when the thing broke after launch	"They called me, and here is how I handled it"	Hands off to a support queue and moves on

Where to find and vet a forward deployed engineer

The best sourcing pools are engineers who have already lived the role under another name. Look for people from delivery-heavy companies built on the embedded model, early engineers from startups who wore the customer-facing hat by necessity, and solutions architects who got tired of handing off before the real work started and learned to ship. The common thread is that they have already been alone in a customer's office with an undefined problem and figured it out.

The vetting mistake to avoid is running a standard algorithmic interview loop. A LeetCode gauntlet tells you whether someone can invert a binary tree under stress; it tells you nothing about whether they can sit with a confused customer and narrow a fuzzy goal into shippable work. The work sample that actually predicts the role is a deliberately under-specified problem: a vague business outcome, a messy fake stakeholder played by your team, and a short window, then watch how they narrow it.

Do they freeze waiting for requirements, or do they start cutting the problem down and propose something to ship? That single signal is worth more than the rest of the loop combined, and it is the spine of how I think about vetting AI engineers generally.

One illustrative example of what good looks like, NDA-safe and composed from the pattern rather than any one engagement: an embedded engineer is dropped into a customer who insists they need a fully autonomous agent to handle their entire support queue. Instead of building the moonshot, the engineer ships a narrow assistant that drafts replies for one ticket category and routes the rest to humans, in week one. The customer sees it working, trust forms, and the real scope emerges from watching the thin slice run.

The weak version of the same hire spends a month building the autonomous agent the customer asked for, demos it, and discovers the data never supported the ambition. Same talent, opposite outcome, and the difference is entirely in the approach to ambiguity.

What a forward deployed engineer costs

An FDE commands a premium over a comparable backend engineer because you are paying for a rarer combination: production engineering plus customer-facing judgment plus tolerance for ambiguity. In the US market, expect total compensation for a strong in-house FDE to land in the senior-to-staff engineer range, often with an additional component tied to the success of the deployments they own. The exact number depends on seniority and your market, and the broader breakdown in the AI engineer cost guide applies here, with the embedded premium layered on top.

The number that should actually drive the decision is not the salary; it is the cost of the alternative. A stalled enterprise pilot that never reaches production burns the deal, the customer relationship, and the months your team spent on it. Measured against that, the FDE premium is small.

The honest tradeoff is utilization: between engagements, an in-house FDE is expensive idle capacity, which is the main reason the partner model exists. You pay for the capability when you need it and avoid carrying the bench when you do not.

The number that should drive the decision is not the FDE salary. It is the cost of the enterprise pilot that never reaches production.

The mistakes that waste the hire

The most common mistake is hiring a pure backend engineer and expecting the customer-facing instincts to appear on the job. They rarely do. A brilliant engineer who needs a clean ticket and goes quiet in front of a stakeholder will produce excellent code against the wrong problem, and you will not find out until the deployment lands flat.

The mirror-image mistake is hiring a polished solutions engineer for the production-code half of the role. They will charm the customer and scope a beautiful demo, then struggle to own a real build in an unfamiliar codebase for six weeks. Both halves of the role are load-bearing, and a candidate who is strong on only one is a half-fit dressed up as a full one.

The third mistake is treating the role as sales in disguise and measuring it on pipeline. An FDE measured on demos and deal influence will optimize for looking good in the room instead of shipping software that survives production. Measure them on what they deploy and what it does for the customer's real numbers. The deeper pattern behind all three errors, hiring for a familiar shape instead of the actual job, is the one I keep returning to across the hiring AI engineers pillar, because it is the error that quietly wastes the most money in AI hiring.

If you would rather skip the trial and error and get someone vetted specifically for ambiguity tolerance and shipping discipline, Devlyn's forward deployed engineers are screened against exactly the signals in the table above, embedded with your customer instead of with us.

Frequently asked questions

What is a forward deployed engineer?

A forward deployed engineer is a customer-embedded engineer who sits inside a client's organization and turns an unclear business case into a shipped, production-grade solution. The role was popularized by Palantir and has spread across AI-native companies, where comparable functions also appear under titles like customer engineer or solutions architect (Wikipedia). They own discovery, the build, and the deployment, not just one ticket in a backlog.

What is the difference between a forward deployed engineer and a solutions engineer?

A solutions engineer sits on the sales side and wins the deal through demos and technical scoping, then hands off after the contract signs. A forward deployed engineer is post-sale and writes production code, staying embedded with the customer until the system works with real users. One sells the capability; the other ships it.

When should I hire a forward deployed engineer instead of a normal engineer?

Hire one when the deal is real but the requirements are not: a high-value customer, a fuzzy problem nobody can spec cleanly, and a pilot that has to become production. If you can hand an engineer a clean spec and a stable interface, a standard product engineer is cheaper and a better fit.

How do I vet a forward deployed engineer?

Skip the algorithmic interview and run a deliberately under-specified work sample. Hand them a vague business outcome and a messy stakeholder, give them a short window, and watch how they narrow it. The engineers who start cutting the problem down and propose something to ship in week one are the ones who will survive the seat; the ones who freeze waiting for a spec will not.

If you want the full system around this role, how an embedded engineer fits into a team that ships fast without breaking trust, my book Building an AI-Native Team walks through it. And when you are ready to put someone in front of your customer, you can hire a forward deployed engineer through Devlyn and have them embedded in weeks rather than quarters. Ship the thing. Then scale it.

How to Choose an AI Development Company

Alpesh Nakrani — Tue, 31 Mar 2026 18:30:00 GMT

I run an AI development company, so read me with that bias. Here is what good actually looks like, the questions that expose a slideware shop, and when to skip a vendor entirely.

The way to choose an AI development company is to test for the three things a slideware shop cannot fake: senior engineers who own the work, evals that prove the system survives production, and a contract that leaves the IP and the architecture in your hands. Good looks like a vendor who will scope a small paid pilot with written success criteria before asking for a year-long commitment, and who tells you what they will not promise before they tell you what they will. If a company leads with a logo wall and a model name instead of a failure mode and a number, you are looking at a reseller.

I should be upfront about my bias, because it shapes everything below. I run revenue at Devlyn, an AI development company. I am not a neutral observer of this market; I compete in it. So I have written this to be useful even if you never call us. The framework here is the one I would want a friend to use if they were vetting my own company, and I have tried to name the red flags that I know how to hide as well as the ones I do not.

I have also sat on the other side of this table. Before revenue I spent fourteen years building software and a decade as a CTO and COO, which means I have hired AI development companies, fired a few, and been the in-house team that an outside vendor was quietly competing against. That is the vantage point this article is written from: both seats, and an honest accounting of what each one sees that the other misses.

Key takeaway: Choose an AI development company on three things a reseller cannot fake: senior engineers who own the work, evals that predict production, and a contract that keeps your IP and architecture.
The demo is not the product. A clean demo proves the happy path works once. The real question is what happens on the messy 40% of inputs that never appear in the pitch.
A scoped paid pilot beats a year-long contract. Written, pre-agreed success criteria on a small bounded problem tell you more about a vendor than any case study about someone else's environment.
Engagement model is a risk decision, not just a price. Fixed bid, time and materials, dedicated pod, and staff augmentation each move different risks onto different parties. Pick the one that matches who owns the outcome.
Sometimes the answer is no vendor. If the AI capability is your core moat and your window is multi-year, build in-house and use a partner only to start the clock.

What an AI development company actually does

An AI development company builds and ships software where the hard part is a probabilistic system: a model, a retrieval pipeline, an agent, an evaluation harness, the cost and latency discipline around all of it. That is the distinction that matters. A traditional software shop ships deterministic code where the same input gives the same output. An AI development company ships systems where the output is a distribution, and most of the engineering work is making that distribution behave acceptably in production.

In practice the work falls into a few buckets. There is custom application development with AI features built in, where the model is one component of a larger product. There is the model and data layer itself: fine-tuning, retrieval, routing between a small model and a frontier one, and the evals that tell you whether any of it is good enough to ship. And there is the operational layer that almost nobody demos: observability, cost monitoring, the human-in-the-loop design for the cases the model gets wrong.

The reason the category exists as something distinct from a normal dev shop is that the failure modes are different. Deterministic software fails loudly; it throws an error or returns the wrong page. An AI system fails quietly. It retrieves perfectly in the demo and watches recall collapse in month three as the corpus grows and the queries drift. A real AI development company is the one that has lived through that collapse and built the discipline to catch it before you do.

What separates a real AI development company from a reseller or a slideware shop

The market is full of companies that have repackaged a frontier API and a prompt as an "AI platform." They are not lying, exactly; they are optimizing for the meeting. The gap between them and a real engineering company shows up in three places, and you can probe all three before you sign anything.

Senior engineers, not juniors hidden behind AI. The dominant pattern of the hype cycle was to staff large numbers of junior engineers, give them code generation tools, and present the output as senior-caliber AI-assisted work. The velocity looks impressive. The quality, over six months, is not. AI tooling amplifies whatever judgment it is attached to, so a senior engineer using it gets faster without getting less careful, and a junior using it produces volume that masks the absence of judgment. Ask who will be in the code on your account, by name and seniority, and ask to talk to them.

Evals, not vibes. A real AI development company treats evaluation the way a traditional team treats tests: not a QA layer bolted on at the end, but a continuous signal that the system is performing. The demo shows you the clean world. The eval shows you the messy one. If a vendor cannot describe how they would measure whether your system is good enough to ship, by failure mode and severity rather than a single headline accuracy number, they do not know whether their own work is good. I have written about why evals are the thing that predicts production, and it is the single most reliable tell I know for separating engineers from salespeople.

Ownership, not lock-in through obscurity. A reseller builds complexity it can explain only to itself, which creates dependency. A real partner codifies what works into a reference architecture your team can maintain and extend without them. This feels counterintuitive, like it reduces the vendor's future revenue, but it expands it: clients trust you with larger work when they know you are not hoarding knowledge to keep them captive.

If a company leads with a logo wall and a model name instead of a failure mode and a number, you are looking at a reseller.

How to evaluate an AI software development company: the questions to ask

Vetting an AI software development company is less about their answers and more about whether they flinch. A real shop welcomes hard questions because the questions are the same ones they ask themselves. A slideware shop redirects to a case study. Here are the questions I would put on the table, and what the good and bad answers sound like.

"How will we know this works in our environment, not your demo?" The good answer is a scoped pilot with written, measurable success criteria agreed before work starts: a number, a dataset, a deadline. The bad answer is a reference to a different client's results, which tells you about someone else's environment and nothing about yours.

"Who, specifically, writes the code, and can I meet them?" The good answer names senior engineers and puts them in the second conversation. The bad answer keeps you talking to a solutions engineer performing delivery while the actual staffing stays vague until after the contract is signed.

"What will you not promise?" This is the one that separates the honest vendor from the optimistic one. A company that has been doing this seriously will tell you the timelines it will not commit to and the outcomes it cannot guarantee. I have argued at length that in a market full of buyers who have been burned by AI, leading with constraints is a credibility signal, not a weakness. A vendor who promises everything is telling you they have not hit the wall yet.

The red flags cluster predictably. Be wary of a fixed price quoted before discovery, because it means the scope is fiction and the change orders are where the real margin lives. Be wary of a refusal to start small. Be wary of any vendor whose entire differentiation is the model they use, because the model is the cheapest, most replaceable part of the stack. And be wary of vague IP terms, which I will come back to, because that ambiguity is rarely an accident.

Engagement models and what they actually cost

How you contract is a risk decision dressed up as a pricing decision. Each model moves different risks onto different parties, and the right one depends on how well-defined the work is and who needs to own the outcome.

Fixed bid works only when the scope is genuinely knowable in advance, which AI work rarely is. The vendor absorbs scope risk, so they price in a buffer and fight every change. It is fine for a tightly bounded proof of concept and a trap for anything exploratory.

Time and materials moves the scope risk back to you and the delivery risk stays shared. It is honest for genuinely uncertain work, but it rewards hours rather than outcomes, so it only works if you trust the team and watch the burn.

A dedicated pod is a standing team that owns a product function and is measured on whether it works. This is the model I prefer, because it aligns the vendor with the outcome rather than the hours; it is also the one that requires the most trust on both sides. The honest cost frame from the in-house comparison applies here too, and I have written the full breakdown of what an AI engineer actually costs if you want the underlying numbers.

Staff augmentation rents you capacity that your own people direct. It is the cheapest to start and the easiest to misuse, because rented engineers with no ownership produce exactly what they are told and nothing more. I have laid out the trade-off between staff augmentation and consulting separately, since the choice between renting hands and buying outcomes is its own decision.

On cost shape, treat any single number with suspicion, including mine. A small standing AI capability, whether in-house or through a partner, tends to land somewhere in the range of several hundred thousand dollars a year, and the variance is enormous. The number that should worry you is not the headline rate; it is the change-order rate, the attrition rate on the vendor's side, and the cost of the work that gets thrown away because nobody defined success before building it.

When to use an AI development company vs hire in-house

The cleanest way to make this call is to ask one question: is this AI capability a thing you must own to win, or a thing you must have to operate? If it is your moat, the part of the product that competitors cannot easily copy, you should be building toward owning it in-house, because the compounding value of a team that lives inside your domain never shows up in a cost comparison. If it is a capability you need to operate but not to differentiate, a partner is almost always faster and cheaper.

Speed is the variable most teams underprice. A senior in-house AI hire typically takes four to six months to recruit and another three to six to ramp, so call it six to nine months before they ship something that matters. A standing partner team ships in weeks because it already exists. If your window is under a quarter, in-house is effectively off the table regardless of strategy, and a partner is the only way to start the clock. I have written the full decision framework for in-house versus outsourced AI development, and the related question of when to hire an AI engineer at all rather than wait.

The honest middle path is hybrid: a partner builds the first version and the reference architecture, your in-house team learns by owning it afterward, and the dependency dissolves on a schedule you control. That only works if you have a strong internal owner to hold the steering wheel. Without one, hybrid degrades into expensive outsourcing with extra meetings.

Is this AI capability a thing you must own to win, or a thing you must have to operate? The answer decides build versus buy before any cost comparison does.

It is worth saying plainly why this matters more in AI than in ordinary software. MIT's NANDA initiative reported in 2025 that roughly 95% of enterprise generative AI pilots failed to deliver measurable business return, a finding widely reported through late 2025. The cause was almost never the model. It was data readiness, workflow integration, and the absence of a defined outcome before the build started. Choosing the right partner, or the right in-house bet, is mostly about avoiding that 95%, and the avoidance is organizational, not technical.

The vetting checklist: criterion, green flag, red flag

Here is the checklist I would paste into a vendor evaluation. Score each criterion as you go; a real AI development company clears the green column on most of them, and any vendor sitting in the red column on staffing, evals, or IP is a walk-away regardless of how good the demo looked.

Criterion	Green flag	Red flag
Staffing	Named senior engineers on your account; you meet them early	Solutions engineer fronts it; real staffing vague until after signing
Evaluation	Evals by failure mode and severity, shared with you continuously	"It performed well in testing"; no measurement plan
Scope start	Small paid pilot with written, pre-agreed success criteria	Pushes a year-long contract before any bounded proof
IP and architecture	You own the code and IP; reference architecture handed over	Vague IP terms; lock-in through complexity only they understand
Honesty	Tells you what they will not promise, before what they will	Promises every timeline and outcome; no named trade-off
Differentiation	Process, evals, and judgment; model is an implementation detail	Entire pitch is the model name and a logo wall
Pricing logic	Engagement model matches who owns the outcome	Fixed bid quoted before discovery; change orders are the real margin

Two short stories, both NDA-safe and generalized from patterns I have seen rather than any single client. A founder once showed me a recommendation feature a vendor had built that demoed flawlessly. There were no evals. When real customer queries hit it, accuracy fell to a level the team only discovered from complaints, because nobody had built the harness to see it sooner. The fix was not a better model; it was the evaluation suite that should have existed on day one.

The other pattern is quieter and more expensive. A company hired a shop on a fixed bid, the scope shifted as they learned what they actually needed, and within two quarters the change orders had doubled the contract while the architecture had calcified into something only the vendor could touch. The lock-in was not in the contract language; it was in the complexity. Both stories are the same lesson from different angles: the thing that protects you is not the price, it is who owns the outcome and whether anyone is measuring it.

If you have read this far and want a partner who works exactly this way, with senior engineers, evals from day one, and ownership you keep, that is the company I run. You can see how Devlyn approaches custom software and AI development, and if you are earlier and want a blunt read on whether your AI bet is even real before you staff it, the AI strategy and readiness conversation is the place to start.

Frequently asked questions

What does an AI development company do?

An AI development company builds and ships software where the hard part is a probabilistic system: a model, a retrieval pipeline, an agent, and the evaluation and cost discipline around them. Unlike a traditional dev shop that ships deterministic code, its core skill is making an uncertain output behave acceptably in production, which is mostly evals, observability, and the design for the cases the model gets wrong.

How do I choose an AI development company?

Test for three things a reseller cannot fake: senior engineers who own the work, evals that measure whether the system survives production, and a contract that keeps your IP and architecture in your hands. Insist on a small paid pilot with written success criteria before any long commitment, and treat a vendor who tells you what they will not promise as more credible than one who promises everything.

How much does an AI software development company cost?

It varies enormously, and any single number deserves suspicion. A small standing AI capability tends to run in the range of several hundred thousand dollars a year, but the figure that should worry you is the change-order rate and the cost of work thrown away because success was never defined. The engagement model, fixed bid versus time and materials versus dedicated pod versus staff augmentation, moves more of your real cost than the headline rate does.

Should I hire an AI development company or build in-house?

Ask whether the AI capability is a thing you must own to win or a thing you must have to operate. If it is your moat, build toward owning it in-house, because the compounding value never appears in a cost comparison. If you need it to operate but not to differentiate, or your window is under a quarter, a partner is faster and cheaper; the hybrid path of partner-builds-then-you-own works well if you have a strong internal owner.

The deeper philosophy underneath all of this, why judgment became the scarce input and how that changes who you should trust to build, is the thing I have spent the most time on. The full playbook is in my guide to hiring AI engineers and at book length in Building an AI-Native Team: Hiring for judgment, not throughput. If you read one thing after this, read that, because choosing an AI development company well is downstream of getting the hiring philosophy right.

AI Consulting Services: What You Get and How to Choose

Alpesh Nakrani — Mon, 30 Mar 2026 18:30:00 GMT

Real AI consulting delivers a shipped, evaluated system, not a deck. Here is what it includes, what it costs, and how to pick a consultant without getting burned.

Real AI consulting services deliver one thing the slide decks never do: a working system in production, evaluated against your data, that someone is accountable for when it breaks. The way you choose one is brutally simple. Ask any consultant to show you something they shipped, the evals they ran on it, and the failure modes they caught before a customer did. The ones who can will talk about traces and edge cases. The ones who cannot will talk about transformation and roadmaps.

I should tell you my bias up front, because this whole piece is an argument against vendors who hide theirs. I run revenue at Devlyn, an AI-native engineering company, and we sell exactly the kind of delivery I am about to describe. So read this as a competitor's honest account of the category, not a neutral one. I will still tell you when not to hire a consultant at all, because the fastest way to lose a client's trust is to sell them something they did not need, and the second fastest is to pretend you have no skin in the game.

I sit in two seats at once. I read the model traces and I read the profit-and-loss statement. That combination is the only reason this article exists, because most writing about AI consulting is produced by people who have sat in exactly one of those seats and are quietly guessing about the other.

Key takeaway: Real AI consulting delivers a shipped, evaluated system you own, not a strategy artifact you file and forget.
Advice and delivery are different products. Pure advice is cheap to produce and easy to be wrong about; the firms worth paying are accountable for the thing actually working in production.
The red flags are loud if you listen. No live demo, no evals, no named engineer on the account, and pricing by the hour rather than the outcome all point the same direction.
Engagement model is a risk-allocation decision. Who absorbs scope risk, you or the consultant, is the real question behind fixed-bid versus time-and-materials.
Sometimes the right answer is do not hire one. If your problem is undefined or your data is not ready, a consultant cannot save you, and a good one will say so.

What AI consulting services actually deliver

Strip away the category language and AI consulting is a sequence of four things, in order, each of which is supposed to de-risk the next. The order matters more than the labels, because skipping a step is where most engagements quietly fail.

The first is readiness and prioritization. A good consultant looks at your data, your workflows, and your constraints, and tells you where AI should go first based on value, feasibility, and risk, not based on what is in the news. This is the part the strategy firms do well and stop at. The output is a prioritized backlog, not a maturity score, because a maturity score does not tell anyone what to build on Monday.

The second is a scoped pilot with success defined in writing before any work begins. Not "the system performs well" but something measurable, like extraction accuracy above ninety-two percent on a validation set you both agreed on, inside eight weeks. The pilot is not a sales trial. It is the period where both sides learn whether the engagement makes sense, and a consultant who refuses to commit to a number before starting is telling you they do not expect to hit one.

The third is delivery: senior engineers building the thing, integrating it with your systems, and handling the unglamorous edge cases that make up most of the real work. The fourth is evaluation, built in from day one rather than bolted on at the end, so you can see that the model is still performing after it ships and not just on demo day. If a consulting engagement gives you the first step and calls it a service, you bought a report. If it gives you all four, you bought a capability.

If you want the deeper version of how that delivery team should be built and held accountable, I wrote a whole book on it: Building an AI-Native Team walks through the roles, cadences, and evidence loops that keep machine output honest.

Advice versus slideware, and why most buyers cannot tell until it is too late

Here is the distinction that the entire category tries to blur. Advice is a genuine product, and good advice is worth real money. Slideware is advice dressed up to look like delivery, sold at delivery prices, with none of the accountability that delivery carries. The two are almost impossible to tell apart in a sales meeting, which is exactly why the meeting is designed the way it is.

The reason the confusion persists is structural, not malicious. The person who builds the strategy deck is optimizing for the meeting. The person who would own the implementation is usually not in the room. The gap between those two realities is where the money disappears, and the buyer almost never sees it until three months in, when the roadmap turns out to assume data they do not have and integrations nobody scoped.

The numbers around this are not subtle. Gartner has predicted that at least thirty percent of generative AI projects would be abandoned after proof of concept by the end of 2025, citing poor data quality, weak risk controls, escalating cost, and unclear business value (reported via THE Journal). MIT's Project NANDA went further in its 2025 report "The GenAI Divide," finding that roughly ninety-five percent of enterprise generative AI pilots delivered no measurable return, and concluding the failure was driven by approach rather than model quality (reported via Virtualization Review).

Read those two findings together and the implication is direct. Most AI projects die in the gap between a pilot that demos well and a system that survives production. Preventing exactly that death is what real AI consulting services are for, and it is the test you should hold every prospective consultant against.

Slideware is advice dressed up to look like delivery, sold at delivery prices, with none of the accountability that delivery carries.

I learned this lens the hard way on the selling side, which I wrote about in selling AI to people who have been burned by AI. The short version: buyers who have already been burned have a very sensitive instrument for detecting slideware, and the honest move is to sell to that instrument rather than around it.

How to choose an AI consultant, and the red flags that should end the call

The selection question reduces to one habit: ask for evidence of shipped, evaluated work, and watch how the answer is structured. A consultant who has shipped will reach for specifics without prompting. A consultant who has not will reach for adjectives. You are not testing their knowledge in that moment; you are testing whether their knowledge has ever met a real user.

I will give you a NDA-safe story that plays out roughly the same way every few months. A founder comes to us holding a sixty-page strategy deliverable from a name-brand firm, paid for in the low six figures, and asks us to "just build the thing in the deck." We read it, and the deck assumes a clean, labeled dataset the company does not have, an integration with a system that was deprecated last year, and an accuracy bar nobody validated against real inputs. The strategy was not wrong on its own terms. It was simply never going to touch production, because nobody who wrote it had ever shipped against that company's actual data.

The red flags cluster, and once you know them they are loud. No live demo of prior work. No evals, or a blank look when you ask how they measure whether the model is right. No named senior engineer who would actually be on your account, just a rotating cast of "resources." Pricing purely by the hour, which quietly aligns their incentive with slowness rather than outcome. And the loudest of all, a refusal to commit to a measurable success criterion before the engagement starts.

Here is the table I would paste into a vendor evaluation. Read each row as a question to ask out loud, then listen for which side of the column the answer lands on.

What to evaluate	Green flag	Red flag
Proof of work	Shows a live system they shipped and the evals behind it	Shows a deck, a logo wall, and "case studies" with no numbers
Who does the work	The senior engineer on your account is in the room and named	Pre-sales engineer presents; juniors deliver behind AI tooling
Success definition	Commits to a measurable bar in writing before starting	Talks in transformation, velocity, and roadmaps; no number
Evaluation	Builds evals in from day one and shares them with you	Treats testing as a QA step at the end, or skips it
Pricing	Priced to an outcome or a fixed scope you can verify	Open-ended hourly with no ceiling and no defined deliverable
Knowledge transfer	Leaves you a reference architecture your team can maintain	Builds complexity only they can explain; lock-in by obscurity

The knowledge-transfer row is the one buyers underweight most. A consultant who hoards understanding to protect future billing is creating dependency, not value, and the burned buyers I talk to can almost always name the vendor who did this to them last.

Engagement models and what they actually cost

Behind every pricing model is a single question: who absorbs scope risk, you or the consultant. Once you see it that way, the menu stops being confusing. The cost ranges below are illustrative of the US market as I see it in early 2026, not a quote, and they move a lot with seniority and domain.

A fixed-bid pilot, scoped to a defined deliverable with a success bar, typically lands somewhere in the range of forty to one hundred and fifty thousand dollars depending on complexity. The consultant carries the scope risk here, which is why a firm willing to work this way is signaling confidence in its own delivery. Time-and-materials, by contrast, puts the scope risk entirely on you, and it tends to run from roughly one hundred and seventy-five to three hundred-plus dollars per senior engineer hour. T&M is appropriate when the problem is genuinely exploratory; it is a trap when it is used to avoid committing to an outcome.

A fractional or advisory retainer, where a senior practitioner gives you a defined slice of their time each month, commonly sits in the five to twenty-five thousand dollar per month band. That model is right when you have a capable team that needs judgment, not hands. I covered the broader build-relationship question in staff augmentation versus consulting, because the words get used interchangeably and they should not be.

The number that matters more than any of these is total cost of being wrong. A cheaper engagement that produces slideware you cannot ship is infinitely more expensive than a higher-quoted one that produces a system in production, because the cheap one costs you the quote plus the months you lose plus the credibility you spend internally defending the decision. Price the outcome, not the invoice.

A cheaper engagement that produces slideware you cannot ship is more expensive than a higher-quoted one that produces a working system, every time.

Consulting versus build versus hire

Consulting is one of three ways to get AI capability, and it is not always the right one. The honest framing, from someone who sells the consulting option, is that you should reach for it only in specific conditions. The other two options are building in-house and hiring permanent engineers, and each beats consulting in its own zone.

Hire permanent engineers when AI is core to your product and will be for years, because that capability should live inside your walls rather than rent. I laid out how to know you have reached that point in when to hire an AI engineer, and what good actually looks like in the pillar on hiring AI engineers. Build in-house when you have the senior judgment already on staff and just need to allocate it; the constraint there is rarely talent and usually focus.

Reach for consulting when you need senior judgment faster than you can hire it, when the work is bounded enough to scope, or when you need someone who has shipped this specific pattern before and can compress your learning curve. The trade-off between owning and renting the capability is the whole subject of in-house versus outsourced AI, and the decision usually comes down to how permanent the need is, not how urgent it feels today.

Here is the second NDA-safe story. A mid-market company hired us for a fixed-bid pilot, we shipped it, and in the close-out we told them the next phase did not warrant a consultant at all; they had the in-house talent to extend it themselves with a light retainer for judgment. We left money on the table saying so. We also got the next two referrals from that founder, which is the only sales math that has ever actually worked for me over a multi-year horizon.

The deliverables to demand before you sign

A contract for AI consulting services should name artifacts, not activities. "We will advise on your AI strategy" is an activity and it commits the consultant to nothing. "You will receive the following, by these dates, meeting these criteria" is a deliverable, and it is the only thing you can hold anyone to.

At minimum, demand a prioritized use-case backlog with value, feasibility, and risk scored per item, so you can defend the sequencing to your board. Demand a written success criterion for any pilot, agreed before work starts. Demand an eval suite delivered with the system, so you can see performance after launch and not just at the demo. And demand a reference architecture your own team can read, maintain, and extend without the consultant in the room.

That last one is the anti-lock-in clause, and it is the deliverable that separates a partner from a dependency. The judgment-over-throughput principle behind all of this, why a smaller amount of accountable, well-evaluated work beats a larger volume of unaccountable output, is something I argued at length in the judgment economy.

If you are weighing an AI initiative right now and want a readiness assessment that ends in a buildable plan rather than a deck, that is precisely what we do at Devlyn's AI strategy and readiness service. It is also fine if, after reading this, you conclude you should hire instead of engage. That is a win for you either way, and a consultant who cannot say that out loud is one of the red flags in the table above.

Frequently asked questions

What do AI consulting services actually include?

At their best, four things in sequence: a readiness and prioritization pass that tells you where AI should go first, a scoped pilot with a measurable success bar agreed in writing, senior-engineer delivery that ships and integrates the system, and an evaluation suite that proves it keeps working in production. A service that gives you only the first step has sold you a report, not a capability. Demand artifacts, not activities.

How much do AI consulting services cost?

It depends on the engagement model and seniority, and the figures here are illustrative of the early-2026 US market rather than a quote. A fixed-bid pilot commonly runs from roughly forty to one hundred and fifty thousand dollars, time-and-materials from about one hundred and seventy-five to three hundred-plus dollars per senior hour, and a fractional or advisory retainer from five to twenty-five thousand dollars a month. The figure that matters most is the total cost of being wrong, which a cheap slideware engagement maximizes.

How do I choose an AI consultant without getting burned?

Ask to see a system they shipped, the evals behind it, and the failure modes they caught before a customer did. The ones who have shipped answer with specifics; the ones who have not answer with adjectives. The loudest red flags are no live demo, no evals, no named senior engineer on your account, hourly-only pricing, and a refusal to commit to a measurable success criterion before starting.

Do I even need an AI consultant?

Not always, and a good one will tell you so. Hire permanent engineers when AI is core to your product for the long term, build in-house when you already have the senior judgment on staff, and reach for consulting when you need that judgment faster than you can hire it or the work is bounded enough to scope cleanly. If your problem is undefined or your data is not ready, no consultant can rescue the engagement, and the honest ones say that before taking your money.

If you want the team-building side of this in depth, Building an AI-Native Team is the companion to this article, and the hiring AI engineers guide covers what good looks like when you decide to bring the capability in-house. When you want delivery rather than a deck, we build exactly this at Devlyn.

Staff Augmentation: When It Beats Hiring (and When Not)

Alpesh Nakrani — Sun, 29 Mar 2026 18:30:00 GMT

Staff augmentation embeds outside engineers in your team while you keep the roadmap and own the outcome. Here is what it is, the models, the real cost, and when it fits.

Staff augmentation is a staffing model where you bring outside engineers into your own team, under your roadmap and your management, instead of hiring them as employees or handing a whole project to an outside shop. It fits best when you know exactly what needs to get built, you are confident in your own direction, and the only thing missing is hands with the right skills, available faster than a full hiring cycle can deliver them.

I have sat on both sides of this arrangement. I have been the engineering leader who needed three senior people next month and could not wait ninety days to hire them. I have also been the partner whose team got embedded into someone else's codebase, standups, and Slack.

So I am going to give you the operator's version of this, not the staffing-vendor version. The vendor version sells you a bench. The operator version tells you when renting one is the right call and when it quietly becomes the most expensive mistake on your roadmap.

Key takeaway: Staff augmentation is right when the direction is clear and the constraint is capacity, not when you are trying to outsource the thinking.
You keep the outcome. Augmented engineers work inside your team and your roadmap; if you want someone else to own delivery end-to-end, that is a project shop or a consulting engagement, not staff aug.
The AI-era version is fewer, more senior people. When generation is cheap, you augment with engineers who can evaluate and own, not with junior throughput you have to supervise.
The failure mode is treating people as rented hands. Skip integration and knowledge transfer and you build a dependency that walks out the door when the contract ends.
Cost is a blended rate, not a salary. It looks more expensive per hour and is often cheaper per outcome, because you skip the hiring cycle, the benefits load, and the cost of a bad permanent hire.

What staff augmentation is, and the three models

Strip away the marketing and staff augmentation is simple. You have a team, and you are missing a skill or some capacity, so you bring in outside engineers who slot into that team, take direction from your leads, attend your standups, and work in your repos. They are not your employees, but for the duration of the engagement they function like members of your team, and you own the roadmap, the priorities, and the definition of done. They supply the execution.

This is a large and growing market, which tells you something about how many teams have landed on the same answer. One industry estimate puts the global IT staff augmentation and managed services market at roughly 318 billion dollars in 2026, up from about 292 billion the year before. You do not need the exact figure to take the point. When a model scales like that, it is usually because it solves a real constraint, not because the marketing is good.

The thing that confuses buyers is that staff augmentation sits next to two other models that look similar from a distance and behave completely differently up close. The distinction that matters is who owns the outcome. Get that wrong and you will buy the wrong thing and blame the people who delivered exactly what you contracted for.

Model	Best for	Watch out for
Staff augmentation	Clear roadmap, known gap in skills or capacity, you want to keep control and direction	You still have to manage them; if your direction is weak, you get fast execution of the wrong thing
Project outsourcing (time and materials or fixed bid)	A scoped piece of work you would rather hand off whole, with a defined deliverable	You give up day-to-day control; scope changes get expensive and slow
Managed services / dedicated team	An ongoing function (support, maintenance, a standing capability) you want run for you	Drift from your priorities over time; the team optimizes for the contract, not your roadmap

If you want to go deeper on the line between renting people and buying an outcome, I wrote a whole piece on staff augmentation versus consulting, because that single question, who owns the result, decides more failed engagements than any rate negotiation ever will. The short version: staff augmentation keeps the outcome on your side of the table. Everything else moves it across.

If you are early in this decision and have not yet settled whether to build the capability internally at all, the prior question is in-house versus outsourced. Staff augmentation is a hybrid answer to that question. It lets you keep ownership in-house while sourcing the hands outside.

When staff augmentation beats hiring or a project shop

Here is the decision logic I actually use. Staff augmentation wins when three conditions hold at once: you know what needs to be built, you trust your own direction enough to steer the work day to day, and the timeline will not survive a full hiring cycle.

That third condition is the one people underestimate. Industry reporting puts the average time to hire a senior software engineer in 2026 around forty-seven days from requisition to accepted offer, and for senior AI engineers it stretches further, because the qualified pool is thin and the demand is brutal. If you have a launch in eight weeks and a gap on the team today, hiring is not a real option, it is a wish. Staff augmentation collapses that timeline because the engineers already exist and are already vetted.

Staff augmentation beats a project shop when you do not actually want to give up control. A project shop is the right call when the work is genuinely separable, when you can write a clean spec, hand it over, and check back at milestones. Most interesting work is not like that; it lives inside your product, touches your data model, and changes weekly as you learn. In that world, handing it off whole means handing off the context, and context loss across an org boundary is where projects quietly die, so if the work needs to stay coupled to your team's daily decisions, augment instead of outsource.

And it beats permanent hiring when the need is real but bounded, or when you are not yet sure the role is permanent. You do not want to hire a full-time specialist for a six-month surge and then face the much harder problem of what to do with them in month seven. If you want the full picture of what a permanent hire actually costs before you commit, I broke down the real cost of an AI engineer separately, because the salary is the smallest part of the number.

If you are weighing all of this against building a team for the long haul, the broader frame lives in my guide to hiring AI engineers, which treats staff augmentation as one lane in a larger talent strategy rather than a standalone tactic. The honest answer is that most teams need more than one lane. The mistake is using the same lane for every problem. If you would rather skip the framework and just talk to senior engineers who can embed this quarter, that is exactly the work we do at Devlyn.

The AI-era version: augment with senior engineers, not bodies

The old version of staff augmentation was about throughput. You needed five sets of hands to grind through a backlog, so you rented five. The model was built for a world where production was the bottleneck and bodies were the answer. That world is fading, and the staffing model that assumes it is fading with it.

When a capable model can produce a first draft, a working prototype, or a test suite in seconds, the constraint stops being how fast you can produce and starts being whether the output is right. I have written about what a team is for after the machine does the work, and the same logic applies to who you augment with. You do not need more hands to fill the canvas. You need people who can look at what the machine produced and know, immediately, whether it is correct, safe, and the right thing to ship.

The old version of staff augmentation rented throughput. The version that works now rents judgment, because judgment is the part the machine cannot supply.

This flips the augmentation math. The instinct under the old model was to rent cheaper, more junior people and supervise them closely, because volume was the goal. Under the new model that is exactly backwards.

A junior augmented engineer producing AI-assisted code you have to review line by line adds review load to a team that is already the bottleneck. A senior augmented engineer who can evaluate, decide, and own a slice of the product end to end removes load. Fewer, more senior people is not a premium option here; it is the cheaper one once you count the review cost.

The full framework for hiring and structuring teams around judgment instead of throughput is in Building an AI-Native Team, and it changes what you should even ask for when you call a staffing partner. Do not ask for three engineers. Ask for one or two who can own the outcome you are missing, and be willing to pay the senior rate, because the senior rate is the one that actually lowers your total cost.

The risks, and how to do it right

The dominant failure mode in staff augmentation has a name on the operator side: rented hands. It happens when you treat augmented engineers as interchangeable units of capacity rather than people you are integrating into a team. You skip the onboarding, you do not give them context, you wall them off from the real decisions, and then you are surprised when the work is technically fine and strategically wrong.

I watched a client do this with a backend team they brought on for a payments build. Smart engineers, strong resumes. But the client kept them out of the product conversations to protect roadmap confidentiality, fed them tickets through a single PM, and never let them talk to the people who understood why the payment flows were shaped the way they were. The team shipped clean code against the tickets, but the tickets were subtly wrong, because the context that would have caught the error never reached the people writing it. That cost a six-week rework that good integration would have prevented in week one.

The second risk is knowledge leaving when the contract ends. If your augmented engineers build something critical and no one on your permanent team understands it, you have not solved a capacity problem. You have rented a dependency. The fix is to require knowledge transfer as a deliverable, not a courtesy, and to pair augmented engineers with at least one permanent owner from day one so the understanding stays after the people leave.

Doing it right is not complicated, it is just disciplined. Onboard augmented engineers like employees for the duration, give them the context rather than just the tickets, and let them into the decisions that shape their work. Pair them with a permanent owner, and hold them to outcomes rather than hours, the same way you should hold your own team, because the moment you start measuring rented people by activity instead of result, you have already lost the plot.

The deeper version of choosing the right partner sits in my piece on how to choose an AI development company, and most of it applies here too.

What staff augmentation actually costs

The sticker shock is real and mostly misleading. A senior augmented engineer bills at a blended hourly or monthly rate that looks higher than the equivalent salary divided into hours. People see that number, compare it to a base salary, and conclude augmentation is expensive. That comparison is wrong, because base salary is not what an employee costs.

The loaded cost of a permanent hire includes benefits, payroll taxes, equipment, software, recruiting fees, the manager time spent hiring, and the weeks of reduced output while they ramp. Layer on the cost of a bad hire, which is brutal in senior AI roles where a wrong fit can quietly burn a quarter, and the real per-outcome cost of an employee is far above the salary line. The blended rate on an augmented engineer already contains all of that overhead, and it disappears the day the engagement ends instead of becoming a severance conversation.

Here is the rough mental model I use, illustrative numbers only, to keep the comparison honest.

// Illustrative only, not a quote, ranges vary widely by market permanent_senior_salary = $180,000 / yr base loaded_multiplier = 1.3 to 1.4 // benefits, taxes, equipment, overhead true_annual_cost = ~$235,000 to $250,000 + ramp time + hiring risk augmented_senior_blended = $130 to $200 / hr six_month_engagement = ~$135,000 to $200,000 all-in, no ramp tail, no severance // Augmentation looks pricier per hour // and is frequently cheaper per outcome on a bounded need.

Staff augmentation gets more expensive than hiring in exactly one situation: a genuinely permanent, full-time need that you keep renting year after year. If you are paying a blended rate for three years for a role that is clearly core and clearly permanent, you are leaving money on the table, and you should convert that to a hire. Augmentation is a tool for speed, bounded needs, and uncertainty, not a permanent substitute for a team you have decided you need. For the full breakdown of where each model lands, the AI engineer cost piece runs the numbers across in-house, augmentation, and agency.

How to choose a staff augmentation provider

Most providers of staff augmentation services are body shops dressed as partners. The questions that expose the difference are not about rates. They are about how the provider thinks about your outcome versus their utilization.

Ask who specifically will work on your account, and insist on talking to those people before you sign, not after. A body shop sells you the senior engineer in the pitch and staffs you with whoever is on the bench. Ask how they handle it when one of their engineers is the wrong fit, because the honest answer reveals whether they optimize for your result or their billed hours. Ask what knowledge transfer looks like as a deliverable, and watch whether they treat it as obvious or as an upsell.

I will give you the tell I trust most: ask the provider to talk you out of the engagement. Describe your situation and ask, directly, when staff augmentation would be the wrong choice for you. A real partner will tell you, because they have a long view and want the next three engagements, not just this one. A body shop will tell you augmentation is perfect for every situation, which is the same as telling you nothing. The provider who is willing to lose a deal to be honest is the one worth hiring.

One more, specific to AI work: ask how they evaluate the quality of AI-assisted output, because in 2026 your augmented engineers are using the same models you are, and the differentiator is no longer who can produce code but who can tell when the produced code is wrong. If a provider cannot describe how their people evaluate model output, they are selling you throughput in a world that has stopped paying for it. If you want a sense of how we think about that on the strategy side before any build, our AI strategy and readiness work is where that conversation usually starts.

Frequently asked questions

What is staff augmentation in simple terms?

It is a staffing model where you bring outside engineers into your own team to work under your direction and roadmap, rather than hiring them as employees or handing a whole project to an outside firm. You keep ownership of the outcome and the priorities; they supply skills or capacity you are missing, usually much faster than a full hiring cycle can deliver.

What is the difference between staff augmentation and outsourcing?

The difference is who owns the result. In staff augmentation you stay in control and direct the work day to day; the engineers function like members of your team. In outsourcing you hand a scoped piece of work to a provider who owns delivery against a spec. Staff augmentation keeps the outcome on your side; outsourcing moves it across the org boundary, along with the day-to-day control.

When should I use staff augmentation instead of hiring?

Use it when the direction is clear, the need is real but bounded or uncertain, and the timeline will not survive a hiring cycle that runs around forty-seven days for a senior engineer and longer for senior AI talent. It is the right tool for speed and flexibility. It becomes the wrong tool when the role is clearly permanent and core, in which case you should convert it to a hire and stop paying a blended rate indefinitely.

Is staff augmentation more expensive than hiring?

Per hour it usually looks more expensive than a salary, but per outcome it is often cheaper on a bounded need, because the blended rate already contains benefits, taxes, equipment, recruiting, ramp time, and the risk of a bad permanent hire, and all of that disappears when the engagement ends. It only becomes the costlier choice when you use it as a permanent substitute for a role you have already decided is core.

If you have decided augmentation is the right move and you want senior engineers who can embed, own an outcome, and evaluate AI-assisted work rather than just produce it, that is the work we do at Devlyn. And if you want the full hiring strategy first, Building an AI-Native Team lays out how to staff for judgment instead of throughput.

What Is a Fractional CTO? A 2026 Operator's Guide

Alpesh Nakrani — Sat, 28 Mar 2026 18:30:00 GMT

A fractional CTO is senior technical leadership on a part-time retainer. Here is what they do, when a startup or SME needs one, and what it costs.

A fractional CTO is a senior technology leader who runs your engineering strategy on a part-time, ongoing basis, usually for a fixed monthly retainer rather than a full-time salary and equity package. You hire a fractional CTO when your technical decisions have started to outweigh your in-house ability to make them, but your stage or budget does not yet justify a full-time chief technology officer. That gap, real technical risk on one side and no senior person to own it on the other, is the exact situation a fractional CTO exists to fill.

I have sat in that seat. I came up as an engineer, moved into commercial and revenue leadership, and along the way I have been the fractional senior technical leader for companies that needed judgment more than they needed another pair of hands. I have also been on the other side, hiring and replacing technical leaders, watching founders pay for a title and get nothing they could use. This is the honest version of both seats, not the sales pitch.

If you are weighing this decision right now, the fastest way to pressure-test it is the same readiness assessment a good fractional CTO would run in week one. Devlyn's AI strategy and readiness assessment maps your real technical risk and where senior judgment actually moves the needle, before you commit to any hire.

A fractional CTO is judgment on retainer, not throughput. You are buying the decisions a senior leader makes, what to build, what to buy, who to hire, not lines of code.
The trigger is risk, not size. When a wrong technical call would cost you a quarter or a fundraise, and nobody in-house can confidently make it, you need one. Headcount and revenue are weak signals on their own.
The 2026 mandate is AI strategy. The modern fractional CTO owns build-vs-buy on AI, the team shape, and whether your AI features will survive contact with production. This is the part most guides miss.
Hire for judgment, not for the title. The dangerous hire is the impressive resume who advises from a distance and never touches the actual decisions or the code.
Engagement models vary widely. Expect a monthly retainer in the low five figures for real involvement; the value is the full-time decision quality at a fraction of the full-time cost.

What a fractional CTO actually is, and what they do

Strip away the marketing and a fractional CTO is one thing: a senior technical decision-owner who works part-time across one or several companies. The "fractional" part means you get a slice of their week, not their whole calendar. The "CTO" part means they carry the same accountability a full-time CTO would, for the technical direction, the architecture, the team, and the risk.

What they do on day one is rarely write code. A good fractional CTO starts by mapping reality: what you have built, what is fragile, what decisions are pending, and which of those decisions could actually hurt you. They translate between the founders who know the business and the engineers who know the system, because that translation gap is where most early technical disasters live.

From there the work is the work of any CTO, compressed: set the technical strategy, make the build-versus-buy calls, own the architecture decisions that are expensive to reverse, hire and structure and sometimes fire the engineering team, and run technical due diligence when you raise or sell. The difference is dosage, not scope. They do the senior thinking; they do not do the daily production.

A fractional CTO is judgment on retainer. You are buying the decisions a senior leader makes, not the hours a junior one bills.

This is the distinction I keep coming back to, and it is the same one I made in my piece on what teams are for after automation: the scarce thing is no longer the ability to produce an artifact. It is the ability to decide whether the artifact is the right one. A fractional CTO is that decision capacity, rented by the month.

When a startup or SME actually needs one (and when it does not)

The honest answer is that most early companies do not need a fractional CTO, and the ones that do often wait too long to admit it. The signal is not your headcount or your revenue. It is the cost of being wrong.

You need one when a single technical decision could sink a quarter or a fundraise, and nobody in your building can confidently make that call. A non-technical founder about to commit eighteen months to a platform architecture. An SME whose entire operation now depends on software the original developer no longer maintains. A team about to pour budget into an AI feature nobody has evaluated. These are risk events, and risk events are what justify senior leadership.

You do not need one when your technical decisions are still small and reversible. If you are pre-product, validating an idea with a no-code tool or a single competent contractor, a fractional CTO is premature. The decisions are not yet expensive enough to be worth senior judgment, and you would be paying retainer rates to over-engineer a thing you might throw away next month.

The middle ground is the advisor. If you have a capable technical lead who makes good calls most of the time but occasionally needs a more experienced voice, you may want a technical advisor on a few hours a month, not a fractional CTO carrying real accountability. The line between the two is ownership. An advisor opines; a fractional CTO decides and is on the hook for the decision.

The decision, in one table

Here is the version I would sketch on a whiteboard for a founder trying to place themselves. Find the row that matches your situation, then read across.

Your situation	Fractional CTO?	Better alternative
Pre-product, validating an idea, decisions small and reversible	No	A strong contractor or no-code build; revisit later
Non-technical founder facing one or two expensive, hard-to-reverse calls	Yes, project-based	A scoped technical strategy review
Growing team, recurring senior decisions, no full-time CTO budget	Yes, retainer	None better at this stage
Capable tech lead who occasionally needs a sounding board	No	A part-time technical advisor
Betting the company on AI features, no in-house AI judgment	Yes, with AI mandate	An AI readiness assessment first
Series B+, technology is the core product, full-time leadership funded	No	Hire a full-time CTO

The pattern in that table is consistent. Fractional is the right answer in the band where the decisions are real and recurring but the full-time cost is not yet justified. Below that band you are too early; above it you have outgrown the model.

The AI-era mandate: what a fractional CTO owns now

This is the part most guides written before 2025 completely miss. The job has changed. A fractional CTO in 2026 is not primarily there to pick a database or set up your CI pipeline. They are there to own your AI strategy, because that is now where the most expensive and most reversible-looking mistakes hide.

The first piece of the mandate is build-versus-buy on AI. Almost every company I talk to is being pushed to "add AI," and almost none of them have a clear answer to the only question that matters: should we build this, buy it, or wrap a model we did not train? Getting that wrong is how teams burn two quarters building an in-house capability they could have rented, or how they ship a thin wrapper on a problem that actually needed real engineering. A fractional CTO makes that call with a clear head. I have written about the underlying choice in in-house versus outsourced AI development.

The second piece is team shape. The right team for an AI-native product is not the team you would have hired in 2019, and the leverage math has shifted hard toward senior judgment. A fractional CTO decides whether you need a generative AI engineer, an ML engineer, or a strong full-stack developer who can integrate models, and in what order. If you are working through that, my guide to AI team structure and the broader definitive guide to hiring AI engineers lay out the roles in detail.

The third piece is whether your AI will survive production. Demos are cheap; production is where AI features quietly fail on the edge cases nobody evaluated. A fractional CTO insists on evaluation discipline before you ship, the same discipline I argue for in the judgment economy, where confident evaluation, not generation, is the real bottleneck. If your company is making an AI bet and has no one who owns this, that is the clearest case for a fractional CTO with an explicit AI mandate. The readiness assessment I mentioned earlier, Devlyn's AI strategy and readiness work, is exactly the structured first pass that maps this before you spend.

What to look for, and the red flags

The hardest part of hiring a fractional CTO is that the worst candidates often have the best resumes. A title at a recognizable company tells you they were in the room. It does not tell you they made the calls, or that they can make yours.

What you actually want is judgment you can observe. Give a candidate a real, messy decision from your business and watch how they reason about the constraint space before they reach for an answer. The strong ones ask what you are optimizing for, what you cannot tolerate, and what the reversal cost is. The weak ones jump straight to a stack recommendation, which tells you they are pattern-matching, not thinking.

The reddest flag is the advisor who never touches the actual decisions. I once watched an SME pay a well-known fractional CTO a healthy retainer for six months: he joined the strategy calls, nodded at the right moments, and produced a slide deck, but he never once owned a decision or got close enough to the code to catch the architectural choice that later cost them a painful rebuild. He had the title and none of the accountability, and the founder did not learn the difference until the bill for the rebuild arrived.

The second red flag is the over-engineer. A certain kind of senior technologist cannot resist building the system they have always wanted to build, on your money, for a company that needed something ten times simpler. I have seen a seed-stage startup get talked into a microservices architecture and a multi-region deployment for a product with a few hundred users. The right fractional CTO matches the engineering to the stage; the wrong one matches it to their own resume.

Cost and engagement models

The honest range is wide, because "fractional CTO" covers everything from a few advisory hours a month to near-full-time involvement. Published 2025 market data puts hourly rates roughly in the $150 to $500 per hour band, with monthly retainers commonly between $3,000 and $15,000 depending on hours and depth. Treat those as illustrative market figures, not a quote; your number depends entirely on scope.

The three common structures are retainer, project, and equity. A monthly retainer buys you a predictable slice of senior time and is the right default for ongoing involvement. A project engagement is scoped to a specific decision or event, a technical due diligence, an architecture review, a build-versus-buy call, and ends when the decision is made. An equity arrangement, sometimes layered on top of a reduced cash rate, aligns a fractional CTO with the company's long-term outcome and is more common with very early startups that are cash-poor.

The frame that matters is the comparison, not the absolute number. A full-time CTO at a funded company runs $183,000 to $390,000 in base salary alone, with total compensation topping $600,000 once equity is included. A fractional arrangement gets you decision quality in that same senior band for a fraction of the cash, because you are paying for the judgment on the decisions that matter, not for forty hours a week of presence you do not yet need.

You are not buying cheaper leadership. You are buying full-time decision quality at part-time cost, on the decisions that actually carry risk.

One more cost that never appears on the invoice: the cost of the wrong full-time hire. A bad full-time CTO is a year of salary, equity, and momentum lost, plus the damage of the decisions they made while you figured out it was not working. A fractional engagement is far cheaper to start and far cheaper to end, which makes it a lower-risk way to get senior judgment into the room before you commit to a permanent one.

The mistakes I see most

The first mistake is hiring a title instead of a decision-maker. Founders get dazzled by a logo on a resume and forget to test whether the person can actually own the calls in front of them. The fix is to make every interview a real decision, not a credential review.

The second is waiting too long. The most expensive technical mistakes get made in the window between "we probably need senior help" and "we finally hired someone." Founders delay because the retainer feels like a lot of money, while a six-figure architectural wrong turn is silently being committed in the meantime. The retainer is almost always the cheaper number.

The third is no handoff plan. A fractional CTO should be building toward a future where you either have a full-time leader or a self-sufficient senior team, not toward permanent dependence on them. If a fractional CTO has no view on how and when their role ends, they are optimizing for their retainer, not your company. The best ones tell you on day one what success looks like and what it means for their own engagement to wind down.

If your immediate pressure is the AI bet specifically, deciding what to build, who to hire, and whether it will hold up in production, that is the sharpest version of the modern fractional CTO mandate. Devlyn's strategy and readiness assessment is built to run exactly that first pass, and the application engineers who build the features pick up once the strategy is set. The full framework for the team you build around those decisions is in Building an AI-Native Team.

Frequently asked questions

What is a fractional CTO?

A fractional CTO is a senior technology leader who owns your technical strategy and decisions on a part-time, ongoing basis, typically for a monthly retainer instead of a full-time salary and equity package. They carry the same accountability as a full-time chief technology officer, the architecture, the team, the build-versus-buy calls, the risk, but for a slice of their week rather than all of it. You are buying senior judgment, not production hours.

When should a startup hire a fractional CTO?

When a technical decision could cost you a quarter or a fundraise and nobody in-house can confidently make it, and your stage does not yet justify a full-time CTO's salary and equity. The trigger is the cost of being wrong, not your headcount or revenue. If your decisions are still small and reversible, you are too early; if technology is your core product and you are funded for it, you have likely outgrown the model and should hire full-time.

How much does a fractional CTO cost?

It varies widely with scope. Published 2025 market figures put hourly rates roughly in the $150 to $500 range, with monthly retainers commonly between $3,000 and $15,000 depending on hours and depth. The useful comparison is against a full-time CTO, whose total compensation can top $600,000 at a funded company; a fractional arrangement buys decision quality in that same senior band at a fraction of the cash.

What is the difference between a fractional CTO and a technical advisor?

Ownership. A technical advisor opines and gives you a more experienced voice when you ask for one, usually a few hours a month with no accountability for outcomes. A fractional CTO decides and is on the hook for the decision, carrying real responsibility for the architecture, the team, and the technical risk. If you have a capable lead who occasionally needs a sounding board, hire an advisor; if you need someone to own the calls, hire a fractional CTO.

Dedicated Developers vs Freelancers: How to Choose

Alpesh Nakrani — Fri, 27 Mar 2026 18:30:00 GMT

Dedicated developers vs freelancers comes down to continuity versus flexibility. Here is the honest tradeoff, the hidden costs of each, and how to choose.

Dedicated developers vs freelancers is really a choice between continuity and flexibility. A dedicated team gives you accountability, accumulated context, and someone who owns the outcome over months; a freelancer gives you speed, a narrow specialty, and a cost you can switch off the day the work is done. Freelancers fit bounded, well-specified jobs with a clear finish line. A dedicated team fits anything that lives in production, evolves, and has to be owned by someone who will still be there in six months.

I have hired both ways for years, and I have been on both sides of the invoice. I have shipped products with a single sharp freelancer who did exactly what I needed in two weeks and disappeared cleanly. I have also watched companies try to build a living product out of a rotating cast of contractors and end up with software nobody understands and nobody owns. The mistake is almost never picking the wrong individual. It is picking the wrong staffing shape for the work in front of you.

This piece is the honest version, written from the operator seat rather than from a dev shop trying to sell you a retainer or a marketplace trying to sell you hours. I will name the real tradeoff, tell you exactly when a freelancer is the right call, when a dedicated team is, the hidden costs nobody quotes you on either side, and a way to actually decide. If you are staffing AI work specifically, the continuity argument gets sharper, and I will explain why.

The core tradeoff is continuity vs flexibility. A dedicated team owns context and outcomes over time; a freelancer gives you flexible, switchable capacity for bounded work.
Freelancers fit a clear finish line. A well-specified, self-contained job with a defined deliverable is where a good freelancer outperforms a team on both speed and cost.
Dedicated teams fit anything that lives in production. Evolving products, on-call ownership, and accumulated domain context all reward continuity over flexibility.
The hidden costs are on both sides. Context loss, ramp-up, and bus factor tax the freelancer model; bench cost and slower switching tax the dedicated model.
AI work tilts toward dedicated ownership. Models drift, evals decay, and prompts rot; without someone who owns the system over time, quality quietly degrades.

The real tradeoff: continuity and accountability vs flexibility and cost

Strip away the sales pitches and the comparison comes down to a single axis. On one end you buy continuity: the same people, accumulating context about your codebase, your customers, and your edge cases, accountable for the outcome over time. On the other end you buy flexibility: capacity you can summon for a specific job and release the moment it is done, often at a lower headline rate and with deeper niche specialization.

Continuity is not a soft virtue. It is the thing that makes the second month cheaper than the first. A dedicated developer who has lived in your system for a quarter no longer asks where the auth logic lives or why that one integration is held together with a workaround. They carry the map in their head. A freelancer rotating in starts at zero on that map every time, and you pay for the re-learning whether you see it on the invoice or not.

Flexibility is not a weakness either. There are jobs where you genuinely do not want a standing team. A one-time migration, a self-contained feature, a specialist skill you need for three weeks and never again. Paying for continuity you will not use is just waste with a nicer name. The skill is matching the shape of the staffing to the shape of the work, not declaring one model universally superior.

The mistake is almost never picking the wrong individual. It is picking the wrong staffing shape for the work in front of you.

Accountability is where the axis bites hardest. When a freelancer finishes and moves on, the accountability for what they built transfers to you, immediately and completely. When a dedicated team owns a system, the accountability stays with them across releases, incidents, and the slow accretion of decisions that make a product work. If the thing you are building will break in production at 2am and someone needs to care, you are buying accountability, and freelancers are structurally bad at selling it.

Freelancer vs dedicated developer: when a freelancer is genuinely the right call

I want to be honest here because most comparisons written by agencies will not be. There is a large class of work where a good freelancer is the correct answer, and forcing a dedicated team onto it is overspending.

The clearest signal is a defined finish line. If you can write the deliverable down in a sentence and know exactly what "done" looks like, a freelancer can often hit it faster and cheaper than spinning up a team. Build this landing page. Migrate this database. Write this one integration against a stable API. Design this logo. The work is bounded, the spec is clear, and continuity adds nothing because there is no second month.

The second signal is a narrow specialty you need briefly. Sometimes you need a Kubernetes expert for a week, or a specific compliance skill for one audit, or a designer for a single launch. Hiring that full time would be absurd, and a generalist team would have to go learn it. A freelancer who does only that thing, all day, every day, will outperform almost anyone you could keep on staff for it. Around 47% of freelancers, nearly 30 million people in the US, provide skilled knowledge services like programming and IT, per Upwork's Freelance Forward research, so the specialist pool for bounded technical work is genuinely deep.

The third signal is genuine uncertainty about whether the work continues. Early validation, a throwaway prototype, an experiment you are not sure will survive contact with users. Buying continuity before you know the work has a future is premature. A freelancer lets you test cheaply and commit to a team only once the work proves it deserves one. Freelancers fit when the job is bounded, specialized, or unproven. The moment it becomes ongoing, owned, and load-bearing, the math changes.

Dedicated development team vs freelancers: when the team is the right call

A dedicated developer or dedicated development team earns its premium the moment the work stops being a project and starts being a product. The defining trait is that the work does not end; it evolves, accumulates, and has to be lived in.

The first signal is that the system lives in production with real users. Production software is never finished. It has incidents, regressions, performance cliffs, and a backlog that never empties. Someone has to own it over time, and that ownership is exactly what continuity buys. A dedicated team that built the thing can debug it in minutes because they remember why it works the way it does. The same incident handed to a fresh freelancer is an archaeology project.

The second signal is that domain context compounds. If understanding your business, your customers, and your data is most of the difficulty, then every restart from zero is expensive. A dedicated team turns that context into an asset that gets more valuable each month. This is the same logic behind staff augmentation, where you embed durable capacity into your own context rather than renting it transactionally.

The third signal is that you need accountability for outcomes, not just delivery of tasks. When you need someone who will own whether the product actually works, not just whether a ticket closed, you need people who stay. This is the same reason the in-house vs outsourced decision rarely turns on raw cost; it turns on who carries the outcome. A dedicated team, in-house or through a partner, can carry it. A freelancer, by design, hands it back.

The hidden costs nobody quotes you

Both models have costs that never appear on the quote. The freelancer's headline rate looks cheaper and the team's looks expensive, but the real comparison includes the costs that hide below the invoice.

On the freelancer side, the biggest hidden cost is context loss. Every time a freelancer rotates out and a new one rotates in, the institutional knowledge walks out the door and you pay to rebuild it. Ramp-up is the visible part of this; the invisible part is the decisions and tradeoffs that never got written down and now have to be rediscovered. The second hidden cost is bus factor: when one freelancer is the only person who understands a critical piece, their availability becomes a single point of failure you do not control.

Scope ambiguity is the third. Freelancers are often paid against a spec, which means anything outside the spec triggers a renegotiation, and software requirements move constantly. A dedicated team absorbs scope drift as part of owning the outcome; a freelancer arrangement turns every change into a small contract negotiation, and those add up in both money and time.

The freelancer's headline rate looks cheaper and the team's looks expensive, but the real comparison includes the costs that hide below the invoice.

The dedicated model has its own hidden costs, and pretending otherwise would be dishonest. You pay for capacity even in the weeks the workload dips, the bench is real. Switching is slower and more expensive; you cannot release a dedicated team the way you cancel a freelancer's next sprint. And there is a turnover risk that is genuinely costly: SHRM estimates the cost of replacing an employee at 50% to 200% of their annual salary, so when a dedicated team member leaves, you absorb a real hit that a freelancer engagement never exposes you to. The honest framing is that neither model is free of hidden cost; they simply hide their costs in different places.

A comparison you can paste into a planning doc

Here is the tradeoff laid out by dimension, so you can drop it into a planning conversation and argue from the same map. The verdict column is mine, from the operator seat, and it assumes the work is ongoing rather than a one-off.

Dimension	Freelancer	Dedicated team
Continuity	Low; context resets when they rotate out	High; context compounds month over month
Accountability	Ends at delivery; transfers to you	Persists across releases and incidents
Flexibility	High; switch on and off quickly	Lower; slower to scale up or down
Headline cost	Often lower per hour	Higher; you pay for the bench too
True cost on ongoing work	Higher once ramp-up and context loss are counted	Lower per unit of outcome over time
Specialization	Deep and narrow on demand	Broad coverage; specialty must be hired in
Best fit	Bounded, specified, finite jobs	Evolving, production, owned systems
Risk	Bus factor; scope renegotiation	Bench cost; turnover replacement cost

The table is not a verdict by itself; it is a way to see which column the weight of your work falls into. If most of your rows point to the freelancer column, hire a freelancer and do not apologize for it. If they point right, you need a team.

Why AI work specifically needs continuity and ownership

Everything above applies to software in general. AI work tilts the axis harder toward dedicated ownership, and it is worth understanding why before you staff it like a normal build.

AI systems are not static once shipped. Models get deprecated and replaced, prompts that worked last quarter drift as the underlying model updates, retrieval indexes go stale, and the eval suite that proved the system was good slowly stops reflecting reality as your inputs change. None of this is visible from the outside. The product looks like it is working right up until it quietly is not, and catching that requires someone who has been watching the same system long enough to notice the drift.

A freelancer who built an AI feature and left cannot do that. They are not watching. The traces accumulate, the failure modes shift, and there is no one holding the institutional memory of why the system was built the way it was. This is precisely the kind of work where continuity is not a nice-to-have; it is the difference between a feature that stays trustworthy and one that degrades into a liability. I have written about this hiring posture more fully in the pillar on hiring AI engineers, and the framework for building a team around judgment rather than throughput is the subject of Building an AI-Native Team.

The ownership point matters even more for AI than the continuity point. Someone has to own the eval suite, decide when a model swap is safe, and be accountable for what the system says to a real customer. That accountability cannot be handed back at the end of a sprint. If you are weighing the staffing model for an AI build, this is where I would put my thumb on the scale toward dedicated. If you want that ownership without standing up a full in-house function, a dedicated AI application engineer from Devlyn is exactly the shape of capacity I would reach for.

How to actually choose

Skip the pros-and-cons list and ask three questions in order. The answers will point you to the right column more reliably than any feature comparison.

First: does the work have a finish line you can describe in a sentence? If yes, lean freelancer. A bounded deliverable with a clear definition of done is freelancer territory, and continuity adds nothing you will use. If you cannot describe "done" because the work keeps evolving, that is your first signal toward a team.

Second: will someone need to own this in production over time? If the thing lives, breaks, and changes after launch, you are buying accountability, not just delivery, and accountability is what a dedicated team sells and a freelancer structurally cannot. If it ships once and you walk away, that ownership question is moot and the freelancer model is cleaner.

Third: how much of the difficulty is accumulated context about your business, your data, or your customers? If most of the hard part is domain knowledge that compounds, every restart from zero is expensive, and continuity pays for itself quickly. If the work is generic enough that any competent specialist could do it cold, you are not paying for context, so do not pay for continuity. For a deeper read on the cost side of this decision, the breakdown in what AI engineers actually cost and the realities of working with an AI development company are both worth your time before you commit.

If you have read this far and you are staffing an AI product that has to stay trustworthy in production, the continuity and ownership arguments both point the same direction. That is the work my team at Devlyn does; you can hire a dedicated AI application engineer here and get the ownership without building the function from scratch.

Frequently asked questions

What is the difference between a dedicated development team and freelancers?

A dedicated development team is durable capacity that owns your system over time, accumulating context about your code, customers, and edge cases, and staying accountable across releases and incidents. Freelancers are flexible, switchable capacity for bounded work; they deliver a defined output and hand accountability back to you when the engagement ends. The team buys you continuity; the freelancer buys you flexibility.

Are freelancers cheaper than a dedicated team?

On the headline rate, often yes. On the true cost of ongoing work, frequently no. Once you count ramp-up, context loss every time someone rotates out, and the scope renegotiations that come with paying against a spec, the freelancer model gets more expensive on work that keeps evolving. For a bounded, finite job, the freelancer usually is genuinely cheaper.

When should I hire a freelancer instead of a dedicated developer?

Hire a freelancer when the work has a clear finish line, needs a narrow specialty you will not need again, or is unproven enough that committing to a team is premature. Build this feature, run this migration, cover this one specialty for a launch. The moment the work becomes ongoing, owned, and load-bearing in production, the case flips toward a dedicated team.

Does AI development change the freelancer vs dedicated team decision?

Yes, it tilts it toward dedicated ownership. AI systems drift after launch; models get deprecated, prompts rot, indexes go stale, and eval suites decay, none of which is visible from the outside. Catching that degradation requires someone who has watched the same system long enough to notice. That continuity, plus clear ownership of the eval suite and model decisions, is hard to get from a freelancer who built the feature and moved on.

The Toptal Alternative That Fits AI Work

Alpesh Nakrani — Thu, 26 Mar 2026 18:30:00 GMT

Toptal is a strong freelance network. For AI product work that needs an engineer who owns the outcome, a senior, AI-native team is the better Toptal alternative.

If you are looking for a Toptal alternative, the honest answer depends on what you are actually buying. For a scoped, well-defined piece of work where you just need a strong vetted freelancer for a few weeks, Toptal is genuinely good and you may not need an alternative at all. For AI product work, where the hard part is owning how a feature behaves in production and not just writing the code, you want a senior, AI-native team that embeds and owns the outcome, and that is a different model than a freelance marketplace.

I should put my bias on the table before I say another word. I run revenue at Devlyn, an AI-native engineering company, so I am one of the alternatives in this comparison. I am going to be fair to Toptal and fair to the other real options, name them by name, and tell you honestly where each one fits, including where Devlyn is the wrong call. You can trust a comparison written by an interested party only if it is willing to send you elsewhere, so I have tried to write the version I would want to read if I were the buyer.

Key takeaway: Toptal is a strong, reputable freelance network; the question is not whether it is good but whether a marketplace model fits the work you have.
The split is ownership. A freelance network rents you vetted hands; AI product work usually needs someone who owns how the feature behaves in production, which is a different purchase.
There are several real alternatives. Turing for global scale, Gun.io for senior US engineers, Arc for fast remote matching, Lemon.io for startups on a budget, and AI-native teams for embedded ownership.
AI work changes the test. The differentiator is no longer who can produce code but who can tell when AI-assisted output is wrong, which favors senior-only delivery.
Cost per outcome beats rate per hour. The cheapest hourly rate is not the cheapest engagement if the work has to be redone.

What Toptal actually is, and what it does well

Toptal is a curated freelance talent network. It markets itself as connecting clients with the top 3 percent of freelance talent, and it spans a focused set of fields: software development, design, finance experts, and project and product management. It is not an open marketplace like Upwork where anyone can list a profile. It is closer to a vetted bench you tap into when you need a specific skill quickly.

The vetting is the core of the pitch, and it is a real process. Toptal says applicants go through multiple stages, including a language and communication screen, a technical interview, a live skills or coding challenge, and a test project, with only a small fraction accepted. Toptal also says you can typically be matched with talent within about 48 hours, and it offers a no-risk trial so you pay only if the engagement works out. Those are Toptal's own claims, and they are consistent with how the network has operated for years. Toptal is reputable enough that it has placed near the top of mainstream reliability rankings, so this is not a fly-by-night operation.

Here is the fair verdict, from someone who competes with them. When the work is well-scoped and separable, a website build, a design system, a finance model, a fixed feature with a clear spec, Toptal does exactly what it promises. You get a vetted, capable freelancer faster than you could hire one, you run a low-risk trial, and you end the engagement cleanly when the work is done. For that shape of work, reaching for an alternative is often solving a problem you do not have.

When the work is well-scoped and separable, Toptal does exactly what it promises. The question is whether your AI work is that shape.

Where the Toptal model gets stretched for AI product work

The marketplace model has a built-in assumption: that the work can be described well enough to hand to a vetted individual who will execute it. That assumption holds for a lot of software. It holds less well for AI product work, and the reason is structural, not a knock on the quality of anyone's talent.

AI features are not done when the code compiles. They are done when the feature behaves acceptably on the messy, real inputs your users actually send, when it fails safely, when its cost per call is under control, and when someone can tell you why it produced a given answer. That work lives in evaluation, observability, and judgment, not in the initial build. A freelancer matched to you for a defined task is incentivized to ship the defined task. The behavior-in-production problem is precisely the part that is hard to scope in advance, which means it tends to fall outside the contract.

I have watched a version of this play out more than once. A team brings in a strong contractor, ships an AI feature that demos beautifully, the engagement ends on schedule, and three months later the feature is quietly drifting on edge cases nobody owns. The contractor did good work against the brief. The brief just did not, and structurally could not, contain the part that mattered most. This is the same demo-versus-production gap I keep running into when I help teams sell AI to buyers who have been burned, and it is the reason I lead with constraints instead of promises.

So the question to ask yourself is not whether Toptal's engineers are good. They are. The question is whether your AI work is the kind that can be handed off as a bounded task, or the kind that needs someone embedded who owns how it behaves after they stop typing. If it is the second kind, a freelance network is the wrong tool, and that has nothing to do with the talent on it.

The real alternatives, compared fairly

If you have decided to look past Toptal, you are not short on options, and they are genuinely different from one another. The mistake is treating them as interchangeable. Here is how I would describe the serious ones, including my own, with the caveat that I compete with all of them.

Turing is built for global scale. It markets a very large worldwide developer pool spanning many countries and uses AI-powered matching on top of its vetting. If your constraint is volume across time zones and you want a large funnel to draw from, Turing is designed for exactly that.

Gun.io focuses on senior, largely US-based engineers, with peer-led technical vetting and a bias toward experienced developers. If you want the curated, agency-like feel with US-time-zone overlap and seniority as the default, Gun.io fits that profile.

Arc is built for fast remote hiring with part-human vetting, and it tends to surface matches quickly, often within a couple of days. If speed and remote flexibility are your priorities, Arc is positioned there.

Lemon.io is startup-focused, screens hard, and draws heavily from Europe and Latin America, which tends to put its blended rates below Toptal's. If you are a startup watching budget and you want vetted developers without enterprise pricing, Lemon.io is the value-oriented pick.

Devlyn, my company, is an AI-native engineering team rather than a freelance network. We staff senior engineers only, default to AI-native delivery, and embed to own an outcome rather than execute a task list. We sell that as productized engagements rather than an open-ended hourly bench. I will detail the differentiator and its limits in the next section, because it is also where I am most biased.

Provider	Model	Best for
Toptal	Curated freelance network, top-3% vetting, no-risk trial	Well-scoped, separable work across dev, design, finance, and PM
Turing	Very large global talent pool, AI matching plus vetting	Scale and volume across many time zones
Gun.io	Senior, largely US-based engineers, peer-led vetting	Seniority and US-time-zone overlap with an agency feel
Arc	Remote hiring, part-human vetting, fast matching	Speed and remote flexibility on a defined role
Lemon.io	Startup-focused, hard screening, EU and LatAm talent	Startups wanting vetted developers below enterprise rates
Devlyn (my company)	Senior-only, AI-native, embedded, outcome-priced	AI product work that needs owned outcomes, not rented hands

Notice that most of these are still variations on the same underlying purchase: a vetted individual matched to your need. They differ on geography, seniority, speed, and price, and those differences are real. The one that sits in a different category is the embedded, outcome-owning model, and whether that category matters to you depends entirely on the work. If you want the broader frame for how these lanes fit into a talent strategy rather than a one-off pick, I laid that out in my guide to hiring AI engineers, which treats marketplaces as one lane among several.

The senior-only, AI-native angle, and my bias

This is the section where I am the interested party, so read it with that in mind. The genuine differentiator for an AI-native team is not a slogan; it is a claim about judgment, and judgment is what AI work now turns on.

When everyone has the same models, the bottleneck stops being who can generate code and becomes who can tell when the generated code is wrong. A senior engineer using AI gets faster without getting less careful, because the judgment that catches a subtle hallucination or a quietly wrong architecture decision is exactly the thing AI does not supply. A junior engineer using the same AI can produce a lot of output that masks the absence of that judgment, and you pay for the gap later as remediation. That is why senior-only is a real differentiator for AI work specifically, and I have made this case at more length on the difference between a senior and a junior AI engineer.

AI-native by default means the engineers build evaluation and observability into the work from day one, because that is how you make an AI feature trustworthy rather than just impressive. Embedded ownership means the engineer is accountable for how the feature behaves in production, not just for shipping it. Those three together, senior-only, AI-native, embedded, are what the marketplace model is not built to deliver, because the marketplace is built around bounded individual tasks.

When everyone has the same models, the bottleneck is no longer who can generate code. It is who can tell when the generated code is wrong.

Now the honest limits, because a fair comparison has to include them. If your work is a clean, scoped build with a clear spec, an embedded outcome-owning team is overkill, and you should use Toptal or Gun.io and pay less. If you need ten engineers next month across many time zones, Turing's scale beats a small senior team. If you are a pre-revenue startup counting every dollar, Lemon.io's rates will be friendlier than a senior AI-native engagement. The embedded model earns its premium only when the cost of the feature behaving badly in production is high, which is most of the time for AI features but not all of the time. If that is not your situation, I would rather you knew it now.

How to choose, and what it costs

The fact that this whole category exists at scale tells you it solves a real constraint. One industry estimate puts the global IT staff augmentation and managed services market at roughly 318 billion dollars in 2026, up from about 292 billion the year before, which is why so many firms compete for your spend. That makes choosing well more important, not less.

Start with one question that sorts most of this: does the work need someone to own how it behaves after they stop working on it, or can it be handed off as a finished deliverable? If it can be handed off cleanly, use a freelance network and optimize on rate, speed, and time-zone fit. If it needs owned behavior in production, the embedded model is worth the premium and a marketplace will quietly underdeliver on the part you care about.

On cost, be careful with hourly rates, because they hide more than they reveal. The numbers that follow are illustrative ranges drawn from public reporting, not quoted prices, and any provider will give you their real figures. Reported blended rates on networks like Toptal commonly land somewhere around 80 to 200 dollars an hour, with senior and AI specialists at the higher end, and some networks add a deposit or a monthly platform fee on top. Value-oriented options like Lemon.io tend to report lower hourly bands. Outcome-priced or productized engagements quote a fixed scope rather than an hour, which is a different unit entirely.

// Illustrative only, not a quote from any provider // The trap is comparing hourly rate instead of cost per outcome freelance_rate = $120/hr - strong vetted contractor ai_native_engagement = fixed scope, senior, owns production behavior // If the AI feature ships but drifts and needs a rebuild: true_cost = first_build + remediation + the_months_it_was_broken // The cheapest hourly rate is not the cheapest engagement // when the work has to be done twice.

I am not telling you the higher-priced option is always right; that would be self-serving and false. I am telling you to compare the thing that actually costs you money, which is the total cost of getting to a feature that works and keeps working, not the sticker rate on an hour. For a fuller breakdown of where the real money goes, the true cost of an AI engineer runs those numbers across hiring, augmentation, and agency models. And if you are still deciding whether to source this outside at all, staff augmentation is the hybrid lane that lets you keep ownership in-house while renting the hands.

What to ask any provider before you sign

This checklist works no matter which option you choose, including mine, and it is the fastest way to separate a partner from a body shop. Use it on Toptal, on Turing, on Gun.io, on Arc, on Lemon.io, and on Devlyn equally.

Ask who specifically will do the work, and insist on talking to that person before you sign rather than after. Ask how they evaluate the quality of AI-assisted output, because in 2026 everyone is using the same models and the only differentiator left is who can tell when the output is wrong. Ask what happens when the person is the wrong fit, and listen for whether the answer protects your outcome or their billed hours. Ask them to describe when their model is the wrong choice for you, because a provider willing to talk you out of a deal has a long view, and one who says they fit every situation is telling you nothing.

For AI work specifically, ask one more thing: who owns the feature's behavior in production after the engagement ends, and how is that handoff documented. If the answer is vague, you have found the gap that will hurt you in month three. I went deeper on the diagnostic questions in my piece on how to vet AI engineers, and the principle is the same whether you are vetting an individual or a firm: trust the provider who is comfortable being measured on the result.

Frequently asked questions

What is the best Toptal alternative?

There is no single best one, because they solve different problems. Turing is best for global scale, Gun.io for senior US-based engineers, Arc for fast remote matching, and Lemon.io for startups on a budget. For AI product work that needs an engineer who owns how the feature behaves in production rather than just delivering code, a senior, AI-native team like Devlyn is the better fit, though I run that company so weigh my view accordingly.

Is Toptal worth it?

For well-scoped, separable work across software, design, finance, and product management, yes. Toptal is a reputable curated network with genuine vetting, fast matching, and a no-risk trial, and for that shape of work it does what it promises. It gets stretched when the work needs embedded ownership of how something behaves in production, which is common for AI features, because a marketplace is built around bounded individual tasks rather than ongoing accountability.

How is an AI-native team different from a freelance network?

A freelance network matches you with a vetted individual for a defined task and ends cleanly when the task is done. An AI-native team staffs senior engineers, builds evaluation and observability into the work from the start, and stays accountable for the outcome in production. The first is the right purchase when the work can be handed off as a finished deliverable; the second earns its premium when the cost of the feature behaving badly is high.

Are Toptal alternatives cheaper?

Some are, on an hourly basis. Value-oriented networks like Lemon.io report lower blended rates than Toptal, while large-scale and senior-focused options vary. But hourly rate is the wrong comparison for AI work. The number that matters is the total cost of reaching a feature that works and keeps working, and the cheapest rate is not the cheapest engagement when the work has to be redone.

If your work is the kind that needs senior engineers who embed, own the outcome, and can tell when AI-assisted output is wrong rather than just produce it, that is the work we do at Devlyn. And if you want the full hiring strategy before you pick any provider, Building an AI-Native Team lays out how to staff for judgment instead of throughput, and the AI development company piece covers how to evaluate a firm rather than an individual.

Turing Alternative: An Honest 2026 Comparison

Alpesh Nakrani — Wed, 25 Mar 2026 18:30:00 GMT

Turing is a fast, large-pool talent cloud. If you are shipping AI features, the fit problem is depth, not quality. Here are the real alternatives, compared fairly.

If you are searching for a Turing alternative, you have usually already formed an opinion about Turing.com and you want to know what else is real. So here is the direct answer first. The best Turing alternative depends on what you are actually hiring for: Toptal and Gun.io for premium human-vetted generalists, Arc and Lemon.io for faster and cheaper senior freelancers, Andela for dedicated embedded hires at scale, and a senior-only AI-native shop like Devlyn when the work is hands-on AI feature delivery and you want someone to own the outcome rather than fill a seat.

I should disclose my bias up front, because the org policy I hold myself to is no misrepresentation. I am Alpesh Nakrani. I started as an engineer and I now run revenue at Devlyn, which is one of the alternatives on this list. I am going to be fair to Turing and fair to everyone else here, because a comparison that only flatters my own company is worthless to you and embarrassing to me. Where I think Devlyn is the right call, I will say so and tell you why. Where it is not, I will point you elsewhere.

This matters because the hiring market for AI work is full of repackaged claims, and the buyers I talk to have learned to distrust the glossy version. If you want the broader context on what good actually looks like, I wrote the long version in my guide to hiring AI engineers. This piece is narrower. It is about Turing specifically, and what to do instead.

Key takeaway: Turing is a large, AI-driven talent cloud that genuinely vets developers and matches fast. The question is not whether it is good. It is whether breadth-and-speed fits your problem.
The fit gap is depth, not quality. Large-pool, AI-assisted matching optimizes for coverage and speed, which moves the burden of screening for hands-on judgment back onto you.
Turing has visibly leaned into AI training data. A meaningful share of the company is now about scoring and refining model outputs, which is a different business from embedded product delivery.
The real alternatives split by model. Premium human vetting, fast freelance pools, dedicated embedded hires, and senior-only AI-native delivery are four different products, not four versions of the same one.
For shipping AI features, optimize for ownership. The expensive failures I see are not slow hires. They are fast hires who shipped something that demoed well and broke in production.

What Turing actually is in 2026

Let me be accurate about Turing before I compare anything, because being unfair to a competitor is both dishonest and a tell that your own pitch is weak. Turing runs what it calls an Intelligent Talent Cloud. It uses AI to source and vet developers at scale, and its own hiring page claims to select from the top 1% of more than three million engineers across 150-plus countries. Every developer clears automated tests across programming languages, data structures, algorithms, system design, and frameworks, plus a 57-question seniority assessment covering project impact, engineering excellence, communication, and direction.

On the buyer side, Turing markets matching most companies with developers within about four days, with a roughly three-week risk-free trial period. That is a real and useful promise if speed and pool size are your primary constraints. None of this is marketing fiction. It is a legitimately large, legitimately fast, AI-driven hiring platform, and for plenty of staffing needs it works.

There is a second thing about Turing in 2026 that you should factor in, and I mean it as fact, not as a knock. A large and growing part of Turing's public business is AI training and evaluation work: domain experts with advanced degrees reviewing, scoring, and refining the outputs of frontier models, with publicized partnerships across major AI labs and chip makers. That is real, valuable work. It is also a different business from staffing an engineer onto your product team, and it is worth knowing that the company's center of gravity has shifted toward it.

The question with Turing is not whether it is good. It is whether breadth and speed fit a problem that actually needs depth and ownership.

Where Turing falls short for hands-on AI delivery

Here is the honest fit critique, and I want to keep it precise so it does not slide into a hit piece. The thing that makes Turing strong for general staffing is the same thing that creates friction for hands-on AI feature work: it is a large pool matched primarily by AI signals. Breadth and speed are the design goal. Depth of judgment on your specific, messy AI problem is not something an automated match can fully guarantee, which means the screening for that depth lands back on you.

For ordinary software roles, that tradeoff is often fine. You can interview, you can run a trial, you can course-correct. For AI feature delivery it gets more expensive, because the failure mode is quieter. An AI feature that wraps a model API can look complete in a demo and then fall apart on the inputs that make up the real, messy world your users live in. I wrote about why vetting AI engineers is harder than vetting general developers, and the short version is that the judgment you are buying does not show up in a coding test.

The second consideration is attention. When a company's center of gravity moves toward AI training data, the embedded-delivery side is still there, but it is no longer the whole story. That is not a criticism of Turing's strategy. It is just a reason to ask, plainly, whether the model you are buying is built for the outcome you need. A talent cloud is built to fill a seat fast. If what you need is someone to own a product function end to end, a seat-filling model is the wrong shape, no matter how good the individual is.

One illustrative example, details changed to stay NDA-safe. A founder I spoke with had hired two strong contractors through a large talent platform to build an AI support agent. Both passed every technical screen. The agent worked beautifully in the demo and then misrouted around a fifth of real tickets, because nobody owned the question of what the agent should do when it was unsure. The engineers were good. The model of engagement had no room for anyone to own that judgment. That is the gap.

The real Turing alternatives, compared fairly

There are several legitimate Turing alternatives, and they are genuinely different products. I have grouped them by the hiring model they are built around, because that is the choice that actually matters. The rate ranges below come from public reporting and the platforms' own marketing, so treat them as illustrative rather than quotes.

Provider	Model	Best for
Turing	Large AI-vetted talent cloud; ~4-day match; trial period	Fast, broad staffing when pool size and speed lead
Toptal	Premium human vetting; markets top 3%; ~$60-200/hr	High-stakes generalist roles where curation matters most
Gun.io	Human-judgment vetting incl. senior technical interview; ~$100-200/hr	Senior freelancers vetted by people, not just tests
Arc	Vetted senior devs, freelance + full-time; ~$60-120/hr	A middle path between premium curation and open pools
Andela	Dedicated, embedded, long-term hires; 2-4 week loop	Scaling embedded teams over months, not weeks
Lemon.io	Pre-vetted, startup-skewed pool; ~24-48h match; ~$55-95/hr	Startups needing a senior freelancer fast and affordably
Devlyn	Senior-only, AI-native, embedded ownership; outcomes over hours	Hands-on AI feature delivery where someone must own the result

A fair word on each. Toptal is the premium human-vetting standard; it markets a top-three-percent acceptance rate, rejects roughly 97 percent of applicants, and runs a multi-stage screen that includes a live interview and a real test project. It is curated and it is expensive, and for a high-stakes generalist hire that curation is the point.

Gun.io leans on human judgment, including a technical interview run by a senior engineer, which is exactly the kind of screen that surfaces architectural reasoning a test cannot. Arc sits in the middle, vetted senior developers for both freelance and full-time, with a broader pool and rates to match. Lemon.io is fast and affordable with a startup-skewed bench, good when you need a capable senior freelancer in your Slack quickly. Andela is built for dedicated, embedded, long-term hires and is a serious option when you are scaling a team over months.

I deliberately did not link out to every one of these platforms, because I am not going to point you at pages I cannot stand behind, and some of them block automated checks. The facts above come from their public claims and from reputable third-party reviews. Verify the current rates and trial terms yourself before you sign anything.

The senior-only and embedded-ownership angle

This is the part where I am pitching my own company, so read it with that bias in mind. Devlyn is senior-only and AI-native, and we work embedded, owning a product function rather than filling a seat. The reason I think this model fits AI delivery is not that our engineers are smarter than everyone else's. It is that the engagement is shaped around ownership instead of hours.

AI tooling amplifies whatever judgment it is attached to. A senior engineer using AI gets faster without getting less careful. A junior engineer using AI can produce volume that hides the absence of the underlying judgment, and you pay for that gap later, in remediation, when the feature meets production. So we run senior-only, and we say it plainly: no juniors hidden behind AI tooling. That single sentence disqualifies the buyers who wanted a cheap seat, which is fine, because they are not who this model serves.

Embedded ownership means we want to own something measurable and be judged on whether it works. That is uncomfortable for a vendor, because it means absorbing scope risk, and it is uncomfortable for a client, because it means defining what success actually looks like. But it is the only structure I know that closes the gap from the founder story above, where everyone was technically competent and nobody owned the hard judgment call.

A talent cloud is built to fill a seat fast. If you need someone to own a product function, a seat-filling model is the wrong shape, no matter how good the individual is.

One more illustrative story, NDA-safe. A team came to us after a fast staffing engagement had shipped an AI feature that passed its tests and still produced wrong answers on the inputs that mattered most. The fix was not a better model. It was an eval suite that measured the failure modes the team actually cared about, plus someone who owned the threshold for when the system should defer to a human. That is delivery work, not staffing work. If that is your problem, this is the engagement we are built for. If your problem is filling a generalist seat quickly, honestly, one of the staffing platforms above is a better fit, and I would rather you go there than hire us for the wrong job.

How to choose a Turing alternative, and what it costs

The choice is mostly about matching the engagement model to the work, not about finding the single best platform. Use the cheapest, fastest model that actually clears the bar for your task, the same discipline I argue for in what AI engineers actually cost.

If you need broad staffing fast and you have the in-house capacity to screen for depth yourself, a talent cloud like Turing or a fast freelance pool like Lemon.io or Arc is sensible, and you will spend somewhere in the rough range of $55 to $120 an hour. If the role is high-stakes and you want maximum human curation, Toptal or Gun.io are the premium options, closer to $100 to $200 an hour. If you are building an embedded team over many months, Andela is purpose-built for that and the longer interview loop is a feature, not a bug.

If the work is hands-on AI feature delivery and the real risk is a feature that demos well and breaks in production, optimize for ownership over hours. That is the senior-only, embedded model, and it is the more expensive sticker per engagement precisely because you are buying accountability for an outcome instead of a block of time. Whether that math works depends on how costly a quiet production failure would be for you, which only you can size. If you are weighing the broader build-versus-buy question, my piece on staff augmentation and how to evaluate an AI development company both go deeper than I can here.

What to ask any vendor before you sign

The same skepticism that makes you search for a Turing alternative is your best diagnostic tool. Use it. Here are the questions I would ask any vendor on this list, including mine, before signing anything.

Who actually does the work? Senior or junior, and how do you know? Ask to meet the person who would be in the code, not a solutions engineer performing delivery.
Who owns the outcome? If the feature underperforms in production, whose problem is that contractually? A seat-filling model rarely has a good answer here.
How do you handle the messy inputs? Ask specifically what happens when the AI is unsure. If there is no answer, nobody owns the hardest part of the job.
What does the trial actually prove? A risk-free trial is only useful if it tests your real workload, not a clean demo. Define the success criteria in writing before it starts.
What are the real rates and terms today? Public rate ranges drift. Confirm current numbers and trial length directly with the vendor.

If you want the deeper framework behind these questions, I put the full operating model in my book Building an AI-Native Team. The questions above are the field-tested short version.

Frequently asked questions

Is Turing a good platform, or should I avoid it?

Turing is a legitimate, large, AI-driven talent cloud with real vetting and fast matching. It is not a platform to avoid. The honest question is fit, not quality. If you need broad staffing quickly and can screen for depth in-house, it can work well. If you need someone to own a hands-on AI delivery outcome end to end, a seat-filling model is the wrong shape for the job.

What is the best Turing alternative for hiring AI engineers?

It depends on the engagement you need. Toptal and Gun.io are the premium human-vetted options, Arc and Lemon.io are faster and cheaper freelance pools, and Andela is built for dedicated embedded hires. For hands-on AI feature delivery where someone must own the result, a senior-only AI-native shop like Devlyn is the model built for that specific problem. I run revenue at Devlyn, so weigh that accordingly.

Why does Turing focus so much on AI training data now?

A large and growing part of Turing's public business is AI training and evaluation: experts scoring and refining model outputs for major AI labs. That work is real and valuable. It is simply a different business from embedding an engineer on your product team, which is worth knowing when you evaluate whether the platform is built for your specific outcome.

How much do these alternatives cost?

Treat these as illustrative public ranges, not quotes. Fast freelance pools like Lemon.io and Arc tend to run roughly $55 to $120 an hour. Premium human-vetted options like Toptal and Gun.io run closer to $100 to $200 an hour. Embedded, senior-only delivery is priced on outcomes rather than hours, so the comparison is accountability for a result versus a block of time. Always confirm current numbers with the vendor.

If your real problem is shipping an AI feature that holds up in production rather than just filling a seat, that is the work my team does, and you can see how we engage here. If your problem is broad, fast staffing, one of the platforms above will serve you better, and I would rather send you there than take the wrong job.

Offshore AI Development: When It Works, When It Burns

Alpesh Nakrani — Tue, 24 Mar 2026 18:30:00 GMT

I run an offshore AI development shop and I have been the buyer too. Here is the honest version of when it works, what it costs, and where it burns you.

Offshore AI development is building your AI features with a team in a different country, usually for a lower fully loaded cost than a domestic hire. It works when you buy senior judgment and clear ownership, and it burns you when you buy cheap hours and assume the model will cover the gap. That distinction is the whole article. Everything below is the reasoning behind it.

I should tell you where I stand before I say anything else, because it changes how you should read this. I run Devlyn, and we deliver AI engineering globally, which means I sell offshore AI development for a living. I have also sat in the buyer's chair for fourteen years, hiring and managing engineering teams across time zones, and watching some of those decisions go badly. So I am not going to tell you offshore always wins; I am going to tell you when it does, because the cases where it fails are the cases that make my job harder.

The reason offshore deserves a fresh look for AI specifically is that the work has changed shape. When a capable model writes the first draft of the code, the human's job narrows to one thing: judgment. That changes the math on who you should be hiring offshore, and it quietly breaks the old offshore playbook of stacking cheap junior hours against a spec.

Key takeaway: Offshore AI development works when you buy senior judgment and ownership, not cheap hours. The hours model is the one that burns you.
The cost gap is real but the rework tax is realer. A distant junior at a low rate that needs heavy rework can cost more than a senior at triple the rate who ships right the first time.
AI work is evaluation-bound, not generation-bound. Generation is cheap now; the scarce skill is telling a confident wrong answer from a correct one, and that skill does not get cheaper offshore.
Timezone, communication, and IP are where offshore quietly fails. Name them in the contract on day one or pay for them in month three.
Pick a partner the way you would hire a senior engineer. Ask who actually writes the code, how they evaluate AI output, and who owns the IP.

What offshore AI development is, and the cost reality

Offshore AI development means contracting a team outside your own country to build, integrate, and ship AI features. That usually means engineers in South Asia, Eastern Europe, Latin America, or Southeast Asia building against your roadmap, on your repos, under your direction. The pitch has always been the same: comparable engineering at a fraction of the domestic cost. The pitch is mostly true and routinely oversold.

Here is the cost reality, with numbers I have verified rather than invented. A fully loaded US developer, meaning salary plus benefits plus overhead, runs roughly $80 to $150 per hour. Offshore rates in Asia commonly land in the $20 to $40 per hour range, with Eastern Europe and senior Latin American talent in the middle. On paper that is a three to five times gap that funds a real product at scale, and published 2026 rate breakdowns put the spread in exactly that band.

The number that matters is not the hourly rate; it is the cost per shipped, correct feature. A distant junior at $25 per hour who needs forty percent rework, two extra review cycles, and a senior on your side to babysit the output is not cheap. A senior offshore engineer at $60 per hour who ships right the first time and needs almost no rework is. The rate gap is real, and the rework tax is realer, and it is the line item most cost comparisons quietly leave out.

For the full breakdown of what an AI engineer actually costs across models, I wrote that up separately in the AI engineer cost article. The short version for this piece: do not buy the hourly rate. Buy the total cost of a working feature, and make whoever pitches you defend that number.

Where offshore AI development wins, and where it burns you

Offshore wins clearly in a few specific situations. It wins when the AI is a feature you need built correctly and quickly on a capability you do not intend to own forever. It wins when the scope is well defined enough that a senior team can run with it. It wins when you need to scale capacity faster than your domestic hiring pipeline can move, and it wins when the cost structure of the domestic alternative would put the project underwater before it ships.

It burns you in an equally specific set of situations, and I have watched every one of them. It burns you when the AI behavior is your actual moat and you hand it to a team that has no stake in getting the nuance right. It burns you when the scope is genuinely ambiguous and you are an ocean away from the conversations that resolve ambiguity. It burns you when you optimize the contract for the lowest rate and discover that you bought juniors hidden behind a model.

That last failure mode is the one I want to name precisely, because it is the defining risk of offshore AI work in 2026. The market is full of shops that will quote you a senior rate, assign a junior, and let a coding model fill the gap between the two. The output looks plausible; it compiles, and it demos. Then it meets production, and the confident wrong answers start surfacing in front of your customers, and nobody on the delivery side has the calibration to have caught them.

The rate gap is real. The rework tax is realer, and it is the line item most cost comparisons quietly leave out.

I learned this lesson early in a way that stuck. A team I was advising had outsourced a document extraction feature to a low-cost shop, thrilled with the rate, and the demo was clean. Three months in, the feature was misreading a specific class of input perhaps eight percent of the time, and every error became a support ticket and a manual fix.

The savings on the rate had been eaten alive by the cost of the errors and the senior time spent firefighting them. The rate was never the price. The rework was the price.

The senior-only counter to the offshore stereotype

The offshore stereotype is cheap juniors, high volume, heavy oversight, and a quality floor you cross your fingers under. That stereotype was earned in an era when the work was generation. You needed bodies to write the code, line by line, so you bought the cheapest bodies that could follow a spec and you accepted the rework as the cost of the savings.

AI work breaks that model, because generation is no longer the hard part. A capable model writes the first draft of the implementation in seconds. What it cannot do is tell you whether that draft is correct, whether it handles the edge case that will surface at 3am, whether the confident output is actually grounded in the data it claims. That judgment is the entire job now, and offshore ai engineers who lack it do not get cheaper, they get dangerous, because they ship plausible wrong work fast.

So the counter to the stereotype is simple and it is the posture I run at Devlyn: senior engineers only, no juniors hidden behind AI. That is not a slogan about junior engineers being bad. It is a statement about what AI work actually requires. The gap between a plausible wrong answer and a correct one is invisible without deep expertise. Buying people who cannot see that gap does not reduce your risk, it buries it, and you find it in production.

This is the same argument I make about team shape in general, and I worked it out in full in Building an AI-Native Team. The offshore version of it is just sharper, because the temptation to trade seniority for rate is built into the offshore sales motion. Resisting that temptation is the most important decision you make when you go offshore for AI.

Timezone, communication, and who owns the IP

The three things that quietly sink offshore engagements are timezone, communication, and IP. None of them are AI-specific, but AI makes each one sharper, and ignoring them is how a good rate turns into a bad quarter.

Timezone is a feature or a tax depending on how you set it up. A team eleven hours ahead can hand you finished work every morning if you build the engagement around asynchronous handoffs and clear written specs. The same team becomes a tax if your model depends on real-time back-and-forth to resolve ambiguity, because every clarification costs you a day. The fix is to insist on a few hours of daily overlap and to write specs precise enough that the offshore team can move without waiting on you.

Communication overhead is the cost most buyers underestimate. AI features are full of judgment calls about model behavior, acceptable error rates, and where a human should review the output. Those calls do not survive a thin spec and a weekly status call. You need a senior on the offshore side who can hold the product context, ask the right questions, and push back when the spec is wrong; that person is expensive and worth every dollar, and their absence is the single best predictor of an engagement going sideways.

IP and data residency are the ones that turn into legal problems if you leave them vague. Decide on day one who owns the code and the model artifacts, where customer data is allowed to live, and what happens to all of it when the engagement ends. I have watched a deal die in procurement not because the work was wrong but because the inference architecture sent customer data somewhere the legal team would not approve. Put data residency and IP ownership in writing before the first commit, not after the first incident.

How to do offshore AI development without getting burned

If you have decided offshore is right for your situation, here is how I would run it, drawn from being on both sides of the table. None of this is exotic. It is just the discipline that separates the engagements that ship from the ones that limp.

First, decide what is yours to own before you outsource anything. If the AI behavior is your moat, keep the core of it in-house and outsource the surrounding work. If the AI is a feature on a capability you will not own forever, outsource it cleanly and move on. I wrote the full framework for that call in the in-house versus outsourced AI piece, and it is the decision you should make before you talk rates with anyone.

Second, buy seniority and ownership, not hours. Structure the contract around outcomes and a named senior who owns the result, not around a body count at a rate. If you genuinely need to extend your own team rather than hand off a project, that is a different model, and staff augmentation is the structure to look at instead of a fixed-scope build.

Third, demand an evaluation discipline. Ask the partner how they prove an AI feature is correct before it ships. If the answer is "we test it" rather than a real eval suite with failure modes and a deploy gate, you are buying vibes. The whole point of AI work is that you cannot eyeball correctness, so the evaluation harness is not optional, it is the product.

Buying people who cannot see the gap between a plausible wrong answer and a correct one does not reduce your risk. It buries it, and you find it in production.

Fourth, start small and instrument everything. Run one well-scoped feature first, measure the rework rate, the communication friction, and the time to a correct ship. A short paid trial tells you more about a partner than any pitch deck, and it costs far less than discovering the truth at full scale six months in.

Choosing an offshore AI development company

When you evaluate an offshore AI development company, interview it the way you would interview a senior engineer, because that is what you are actually buying. The rate card is the least informative thing on the page. The questions below tell you whether you are buying judgment or buying hours.

Ask who actually writes the code. Get the names and the seniority of the people on your engagement, not the headcount of the firm. Ask to talk to the engineer who will lead your work, and judge whether they can reason about model uncertainty and product trade-offs, not just frameworks.

Ask how they evaluate AI output, how they handle the failure modes specific to your domain, and who owns the IP and the data when you part ways. Ask for a reference where the engagement went long, because anyone can deliver a clean three-week sprint and the question is what happens in month nine. The way a firm answers these tells you more than its case studies.

Factor	Where offshore wins	Where offshore burns you
Cost	3 to 5x lower fully loaded rate funds a real product	Low rate hides a rework tax that eats the savings
Seniority	Senior pods ship correct work the first time	Juniors behind a model ship plausible wrong work fast
Scope	Well-defined features a senior team can run with	Genuinely ambiguous work resolved an ocean away
Timezone	Async handoffs deliver finished work every morning	Real-time dependency turns every question into a lost day
Ownership	Named owner accountable for the outcome	Diffuse headcount with nobody owning the result
IP and data	Residency and ownership settled in writing on day one	Vague terms that surface as a legal problem later

For the broader question of how to vet any AI delivery partner, offshore or not, I covered the full checklist in choosing an AI development company. The offshore version adds the timezone and IP questions on top, but the core test is the same: are you buying judgment, or are you buying hours.

The global IT outsourcing market is large and growing, valued at roughly $639 billion in 2026 by Mordor Intelligence and projected to reach $752 billion by 2031. That scale means there are excellent offshore partners and there are shops that will quietly hand you juniors at a senior rate. The market size does not protect you. The questions above do.

If you want a team that ships AI features with senior engineers and evaluation built in from day one, that is the work we do at Devlyn. We deliver globally, we staff senior, and we will tell you when offshore is the wrong answer for your situation, because the engagements that fail are the ones that make this harder for everyone.

Frequently asked questions

What is offshore AI development?

Offshore AI development is building your AI features with an engineering team in a different country, usually at a lower fully loaded cost than a domestic hire. It covers integrating models, building the product features around them, and shipping them to production under your direction. It works best when you buy senior judgment and clear ownership rather than the cheapest available hours.

How much does offshore AI development cost?

Offshore rates commonly run $20 to $40 per hour in Asia, with Eastern Europe and senior Latin American talent in the middle, against a fully loaded US rate of roughly $80 to $150 per hour. The headline gap is three to five times, but the number that matters is cost per correct shipped feature, because a cheap engineer who needs heavy rework can cost more than a senior who ships right the first time.

Are offshore AI engineers good enough for production AI?

Senior offshore AI engineers are absolutely good enough, and the seniority is the whole point. AI work is evaluation-bound, not generation-bound, so the scarce skill is telling a confident wrong answer from a correct one. A senior who has that calibration ships production-grade work; a junior hidden behind a model ships plausible mistakes quickly, regardless of where they sit.

How do I choose an offshore AI development company?

Interview it like a senior engineer, not a vendor. Ask who actually writes the code and their seniority, how they evaluate AI output before shipping, and who owns the IP and data when the engagement ends. Run one small paid feature first and measure the rework rate before you commit to anything at scale.

If you are still deciding whether to build, buy, or augment for AI in the first place, start with my full guide to hiring AI engineers, which lays out every option before you ever pick a country.

Nearshore vs Offshore: Which Fits AI Development

Alpesh Nakrani — Mon, 23 Mar 2026 18:30:00 GMT

Nearshore vs offshore comes down to timezone and total cost, not the hourly rate. For AI work, the bigger question is who owns the outcome.

The difference between nearshore and offshore is timezone, and almost nothing else that a brochure tells you matters as much as that. Nearshore means a team within an hour or two of your working day, Latin America for a US buyer, and it fits teams that need to iterate together in real time. Offshore means a distant timezone, Asia or Eastern Europe for that same buyer, and it fits scoped work that can move asynchronously while you sleep. I have hired and delivered from both sides, and I want to give you the honest version of the tradeoff rather than the one written by whichever staffing vendor happens to sit in the region they are selling.

I am an engineer who became a CRO, and I now build customer-facing AI at Devlyn, so I have sat in both seats. I have been the buyer wiring money to a team eleven hours ahead, and I have been the partner on the other end of that wire delivering against someone else's roadmap. The nearshore vs offshore decision looks like a procurement question. For AI work specifically, it is really a question about feedback loops, and the geography is just a proxy for how fast those loops can close.

If you are weighing where to source an AI team right now and want a second opinion from someone who has built these teams rather than sold them, the Devlyn team takes on exactly this kind of build. The rest of this piece is the framework I would give you for free.

Key takeaway: Nearshore vs offshore is a timezone decision first. Same-day overlap buys you tight iteration; a distant timezone buys you cheaper async throughput. Pick for the loop, not the logo.
The hourly rate lies. Total cost of engagement includes revision cycles and rework, and a cheaper hourly rate often delivers a more expensive finished feature.
Nearshore wins when the work is ambiguous and iterative. AI features are exactly that, so the real-time loop is worth more here than it is for spec-complete CRUD work.
Offshore wins when the work is scoped, spec-complete, and async-tolerant, or when cost genuinely dominates the decision and you can absorb a slower loop.
Geography is the second question. The first is whether the team is senior enough to own the outcome. A senior team wins from either timezone; a junior team fails from both.

Nearshore and offshore, defined without the sales gloss

Nearshore development means contracting a team in a country close to yours, close enough that your working hours substantially overlap. For a US company, that is almost always Latin America: Mexico, Colombia, Brazil, Argentina. The defining feature is not the language or the culture, although those help. It is that when you have a problem at 10am your time, someone qualified is awake and at a keyboard to solve it before lunch.

Offshore development means a team in a distant timezone, where the overlap with your day is small or nonexistent. For a US buyer that usually means South and Southeast Asia or Eastern Europe. The defining feature is the handoff: you write up what you need at the end of your day, it gets built overnight, and you review it the next morning. When the handoff is clean, this is a genuine superpower. When it is not, you lose a full day to every misunderstanding.

People load these two words with a lot of baggage they do not deserve. Offshore is not a synonym for cheap and bad, and nearshore is not a synonym for expensive and good. Plenty of the strongest engineers I have worked with were offshore, and I have seen nearshore teams that billed a premium for proximity and delivered very little with it. The labels describe where the team sits relative to your clock. Everything else is a separate question that the geography conversation tends to smuggle in unexamined.

The timezone and communication tradeoff is the whole game

Strip away the marketing and the entire nearshore vs offshore decision reduces to one variable: how many hours a day can your team and theirs talk to each other in real time. Latin America gives a US buyer most of a shared workday. Bogota and Lima sit on US Eastern time effectively year round, which means a standup at 9am and an unblock at 3pm both just work (Launch Day Advisors lays out the regional overlap if you want the map). A team in Eastern Europe runs six to eight hours ahead of US Eastern, leaving you a narrow window in your morning. A team in South Asia may leave you almost none.

That overlap is not a nicety. It is the rate limiter on how fast a question turns into an answer. In a high-overlap setup, a confused requirement gets clarified in a Slack thread inside the hour and the work continues. In a low-overlap setup, that same confusion costs you a full cycle: you flag it, they read it the next day, they reply with a question, you read that the day after, and a thirty-second clarification has eaten three calendar days.

The honest counterpoint is that async is a discipline, not a disability. A team that writes excellent tickets, records clear context, and never blocks on a synchronous answer can run beautifully across twelve hours of offset. I have seen offshore teams that were faster than co-located teams precisely because the timezone gap forced them to write everything down. But that discipline is rare and it has to be built deliberately. If you do not have it, the offset will punish you, and the punishment scales with how ambiguous the work is.

Timezone overlap is the rate limiter on how fast a question turns into an answer. For ambiguous work, that speed is the whole ballgame.

The cost difference is real, and smaller than the brochure says

Here is the part the rate cards get right and the conclusion they get wrong. Offshore is genuinely cheaper per hour. Senior nearshore engineers in Latin America tend to run somewhere in the $50 to $90 per hour range, while offshore Asia often comes in at $25 to $45 (those bands are from DistantJob's 2026 rate breakdown, and they move with seniority and specialization). On the hourly line alone, offshore can be half the cost or better. If you stop the analysis there, offshore wins every time, and a lot of buyers stop the analysis there.

The number that actually matters is total cost of engagement, which is the hourly rate multiplied by the hours it takes to ship the thing correctly, plus the cost of everything that went sideways along the way. Revision cycles, rework, misread requirements, the feature that got built to the letter of a ticket that turned out to be wrong. These are real costs and they do not show up on the rate card. A team that costs more per hour but needs half the revisions can land the finished feature for less total money, which is the case nearshore vendors make and, on iterative work, often the case the math supports.

I want to be careful not to overclaim this, because it cuts both ways. On a tightly scoped build where the spec is genuinely complete and the rework risk is low, the hourly savings of offshore flow almost straight to the bottom line and the total cost really does come in lower. The total-cost argument is not a universal win for nearshore. It is a reminder that the rate is an input, not the answer, and that you have to estimate the rework before you can compare honestly.

When nearshore wins

Nearshore wins when the work is ambiguous and the requirements will change as you learn. That describes most early-stage product work, most zero-to-one features, and almost all AI work, which I will come back to. When you cannot fully specify the thing up front because you are discovering it as you build, the value of being able to course-correct in the same afternoon is enormous. Every hour of timezone overlap is an hour you can spend steering instead of waiting.

It also wins when the work is collaborative rather than handed off. If your engineers and theirs need to pair, whiteboard, debug a production incident together, or sit in the same design review, the shared workday stops being a convenience and becomes a requirement. A live incident at 2pm your time is a very different experience when the people who wrote the code are awake versus when they will not see your page for ten hours.

Here is an illustrative case, the kind I have seen play out more than once. A team building an AI intake assistant kept changing what "good" looked like every week as they watched real users hit edge cases the spec never imagined. Their nearshore partner sat two hours behind them and simply moved with it, reshaping the prompt logic and the eval set in the same week the new failure modes appeared. The same project run across a twelve-hour gap would have spent that week writing tickets about last week's problems.

When offshore wins

Offshore wins when the work is scoped and spec-complete, the kind of build where you can hand over a clear definition of done and trust it will come back built to that definition. Data pipelines with known inputs and outputs, a well-defined integration, a backend service against a settled API contract, a migration with clear acceptance criteria. When the requirement is not going to move, the timezone gap stops being a tax and starts being free overnight throughput. You go to sleep, work happens, you wake up to progress.

It also wins when cost genuinely dominates the decision and you can structure the work to tolerate a slower loop. Some budgets are real constraints, not preferences, and if halving the hourly rate is the difference between building the thing and not building it, the slower feedback loop is a tradeoff worth making deliberately. The mistake is choosing offshore for the cost and then handing it ambiguous, fast-changing work that needs the loop you just gave up.

An illustrative example from the other direction. A company needed a batch document-processing system built against a fixed, well-understood schema with clear accuracy thresholds, no ambiguity about what done meant. They handed it to an offshore team, defined the eval criteria up front, and got it back built correctly at roughly half the nearshore quote. The work never needed a real-time loop, so the timezone gap cost them nothing and the rate savings were pure win.

The thing that matters more than geography: senior ownership

Here is the reframe I came to after enough of these engagements went well or badly for reasons that had nothing to do with where the team sat. The biggest predictor of whether an outsourced AI build succeeds is not nearshore versus offshore. It is whether the team is senior enough to own the outcome rather than just execute the ticket. A senior team that owns the result wins from either timezone. A junior team that needs you to specify every decision fails from both, and fails more expensively offshore because the loop you need to compensate is the loop you do not have.

This is the same lesson I keep relearning about AI work in general. When generation is cheap, the scarce skill is judgment, the ability to look at what got built and know whether it is actually right. I have written about how that reshapes the in-house versus outsourced decision, and the same logic applies across geography: you are not buying hours, you are buying the judgment to spend those hours on the right thing. A team that needs the spec handed to them in full detail is a team you have to think for, and thinking for a team eleven timezones away is brutal.

So before you let anyone steer you into a region, ask the harder question. Can this team take a vaguely defined outcome, push back on it, decompose it, and own the result through to something that works in production. If the answer is yes, geography becomes a logistics detail you optimize for convenience. If the answer is no, no timezone will save you, and the cheap hourly rate is the most expensive thing on the table. For the full picture on what senior judgment looks like in practice and how to hire for it, that is the heart of my guide to hiring AI engineers.

A senior team that owns the outcome wins from either timezone. A junior team that needs every decision specified fails from both.

How to choose for AI work specifically

AI work has a property that tilts this decision harder than ordinary software does: it is eval-and-iteration heavy by nature. You do not write an AI feature once and ship it. You build it, run it against real inputs, watch where it fails, adjust the prompts or the retrieval or the routing, and run it again. The loop between "we shipped a change" and "we know whether it helped" is the actual unit of work, and it runs many more times per feature than it does for deterministic software.

That property raises the value of timezone overlap specifically for AI builds. Every tightening of an eval set, every adjustment after a bad generation surfaces in production, every "the model is doing this weird thing on these inputs" benefits from being resolved in hours rather than days. This is why my default for genuinely AI-native, fast-moving work leans nearshore or co-located, while my default for scoped AI infrastructure with settled requirements is comfortable going offshore. The work itself tells you which loop speed you need.

The practical move is to split the work by loop speed rather than picking one region for everything. Put the ambiguous, iterative, customer-facing AI work where the loop is tight, and the scoped, spec-complete infrastructure work where the cost is low. A standing partner who can give you senior ownership in a high-overlap timezone for the hard part, and let you push commodity build work elsewhere, is usually the cleanest structure. That hybrid is also why staff augmentation often beats a pure project handoff for AI work: it keeps the outcome and the judgment on your side of the table while sourcing the hands where they make sense.

And run the real number before you decide, not the rate-card number. The honest comparison is loaded cost per shipped feature, including the rework you are likely to eat at each loop speed, which is the same discipline I walk through on what an AI engineer actually costs. The deeper framework for structuring a team around judgment instead of headcount is the whole argument of Building an AI-Native Team.

A comparison you can paste into a deck

Dimension	Nearshore (e.g. LATAM for US)	Offshore (e.g. Asia, Eastern Europe)
Timezone overlap	Most of the workday; real-time iteration	Small to none; async handoff
Hourly rate	Higher (roughly $50-$90/hr senior)	Lower (roughly $25-$45/hr senior, Asia)
Best-fit work	Ambiguous, iterative, fast-changing, collaborative	Scoped, spec-complete, async-tolerant
Loop speed	Fast; clarifications close in hours	Slow; clarifications cost a full cycle
Total cost on iterative work	Often lower despite higher rate (fewer revisions)	Can balloon if rework is high
Total cost on scoped work	Higher; you pay for an overlap you barely use	Lower; rate savings flow straight through
Fit for AI feature work	Strong; eval-and-iterate loop benefits from overlap	Good for AI infra; weaker for fast-changing AI features
What actually decides it	Whether the team is senior enough to own the outcome. Geography is the second question.

Frequently asked questions

What is the difference between nearshore and offshore software development?

Nearshore means a team in a nearby timezone with substantial overlap with your working day, typically Latin America for a US buyer. Offshore means a team in a distant timezone with little to no overlap, typically Asia or Eastern Europe for that same buyer. The practical difference is how fast a question turns into an answer: nearshore resolves clarifications in hours, offshore in days. Cost, culture, and language are real factors too, but timezone is the one that changes how the work actually feels day to day.

Is nearshore more expensive than offshore?

Per hour, yes, usually by a meaningful margin. Senior nearshore rates in Latin America commonly run higher than offshore Asia rates. But the number that decides the budget is total cost of engagement, the rate multiplied by the hours to ship correctly plus the cost of rework. On ambiguous, iterative work, nearshore's tighter loop often produces fewer revisions and a lower total cost despite the higher rate. On scoped, spec-complete work, offshore's hourly savings usually flow straight to the bottom line.

Which is better for AI development, nearshore or offshore?

It depends on the work, but AI feature development leans nearshore because it is eval-and-iteration heavy and benefits from a fast feedback loop. AI infrastructure with settled requirements, like a data pipeline or a scoped integration, runs fine offshore. The strongest setup for many teams is to split by loop speed: tight-overlap teams on the fast-changing AI work, lower-cost teams on the scoped build work. The deciding factor is always whether the team is senior enough to own the outcome.

Does the nearshore vs offshore choice matter more than the team's seniority?

No. Seniority and ownership matter more than geography. A senior team that can take an ambiguous outcome, push back on it, and own the result through to production wins from either timezone. A junior team that needs every decision specified for it fails from both, and fails more expensively offshore, because the real-time loop you would need to compensate is exactly the loop a distant timezone takes away. Decide on ownership first, then optimize geography for convenience.

If you are sourcing a team for an AI build and want it run by senior engineers who own the outcome rather than wait for tickets, that is the work we do at Devlyn. You can hire an AI application engineer who closes the loop fast, in a timezone that fits how your work actually moves.

Do You Need an AI Engineer? An Honest Decision Rule

Alpesh Nakrani — Sun, 22 Mar 2026 18:30:00 GMT

Do you need an AI engineer? Only when AI work is recurring, core, and failing in ways your team cannot diagnose. Here is the honest rule and the alternatives.

Do you need an AI engineer? For most teams reading this, the honest answer is not yet, and possibly not at all. You need a dedicated AI engineer when AI work has become recurring, core to the product, and is failing in production in ways your current team cannot diagnose. If you cannot say yes to all three, an API, a no-code tool, or a scoped partner will serve you better and cheaper than a six-figure full-time hire you are not yet equipped to evaluate.

I have made this hire more than 80 times at Devlyn, and I make it sitting in two seats at once: I read the model traces and I read the P&L. That combination is why I am suspicious of the standard advice, which treats hiring an AI engineer as something you should do as early and as eagerly as possible. It is not. It is a decision with a real wrong answer in both directions, and the wrong answer is expensive enough that I would rather talk a founder out of the hire than watch them make it for the wrong reason.

This is the decision itself, from both seats. It is part of my broader guide to hiring AI engineers, which covers what the role is and how to vet it. Here I am only answering whether you need one. If you have already decided you do and are asking whether the moment has arrived, that is a different question I cover in when to hire an AI engineer. This piece is the prior step: do you need the role at all, and if so, in what form.

The rule is recurring, core, and failing. You need a dedicated AI engineer when AI is load-bearing in your product and no one currently owns whether the model is correct. Anything short of that, you do not.
Most AI work is an API call. A large fraction of what teams want an AI engineer for is solved by a hosted model behind a thin integration that any competent generalist can ship and maintain.
If you cannot vet the hire, do not hire full-time yet. A specialist you cannot evaluate is a gamble, not a hire. Buy pre-vetted judgment through a partner or a fractional engagement until you can grade the work.
The market is structurally short. AI and machine-learning roles are the highest-demand technology positions, and 71% of technology leaders say skills shortages have already delayed projects. You are competing for a small pool at a high price.
Hiring too early is a six-figure mistake. A wrong senior AI hire costs well beyond salary once you count the search, the ramp, the opportunity cost, and the cleanup. Get the decision right before you get the candidate right.

If you want a second read on your own situation before you open a role, Devlyn's AI strategy and readiness work is built to make exactly this call, including when the honest answer is no.

Do you need an AI engineer? The honest decision rule

The rule fits in one sentence: hire a dedicated AI engineer when AI work is recurring, core to the product, and failing in production in ways your current team cannot diagnose. All three conditions have to hold at once. Drop any one of them and the hire is premature.

Recurring means the AI work shows up week after week, not as a single launch you can ship and forget. A one-time integration is a project. A standing stream of model behavior to tune, evaluate, and defend is a role. If the work has a finish line, you do not need a permanent owner for it.

Core means the AI is load-bearing. If the model is wrong, a customer notices and the product is worse, or revenue moves. A summarization feature buried three menus deep that nobody relies on is not core. The recommendation engine that decides what a shopper sees is. The closer the AI sits to the thing you charge money for, the more the hire earns its cost.

Failing in ways your team cannot diagnose is the condition most teams skip, and it is the one that actually separates a real need from a want. If your strongest generalist can look at a bad output and tell you why it is wrong and what to change, you may not need a specialist yet. If the answer to "why did it do that" is a shrug, and that shrug is now sitting in front of customers, that is the signal. The need is not "we want AI." The need is "AI is load-bearing and no one owns whether it is correct."

Signs you DO need an AI engineer

Here are the signals that the answer is genuinely yes. They are about the shape of the work, not the size of the ambition.

The AI is in the critical path of the product. Customers touch model output directly, and when it is wrong, someone has to apologize or fix it in real time.
No one currently owns correctness. You have a feature shipping model output to users and not one person whose job is to know whether it is right and to be accountable when it is not.
The work is recurring and growing. Every week there is more to tune, more edge cases surfacing, more model behavior to evaluate. It is a stream, not a sprint.
Failures are diagnostic dead ends. The model breaks in production and your team cannot reliably tell you why, which means they cannot reliably fix it either.
You need evals, not vibes. Decisions about whether a model is good enough are being made by gut feel in a meeting, and that gut feel is now expensive when it is wrong.

When most of those are true, you are not hiring out of fashion. You are hiring because an accountability gap has opened in your product and it is starting to cost you. That is the right reason, and it is the only one that survives contact with a board conversation.

Consider a fictional but typical case. A 30-person logistics startup, call it Cartwheel, built a document-extraction feature on a hosted model and shipped it in a weekend. Six months later it was processing thousands of customer invoices a day, the extraction was wrong about 8% of the time on a handful of vendor formats, and a support rep was quietly correcting the failures by hand. Nobody could say which formats failed or why. That is recurring, core, and undiagnosable at once. Cartwheel did not have a model problem. It had an ownership problem, and the fix was a person who could read the failures, not a bigger model.

The need is not "we want to use AI." It is "AI is load-bearing and no one owns whether it is correct."

Signs you do NOT need one (use an API, no-code, or a partner)

This is the section most articles skip, because "you probably do not need to hire us yet" is not a comfortable thing for a vendor to say. I will say it anyway, because saying it is how you earn the right to be believed when the answer is yes.

You do not need a dedicated AI engineer when any of the following is true. A hosted API behind a thin integration solves the task, and a competent generalist on your team can ship and maintain that integration. The work is a one-off: a single launch, a prototype, an internal tool that does not need a standing owner. You are pre-product-market-fit and still validating whether the AI feature matters to anyone. Or you cannot yet evaluate the hire, in which case bringing on a senior specialist is buying something you cannot inspect.

Each of those has a better tool than a full-time hire. For the API case, the work is integration, not model engineering, and your existing engineers are the right people for it. For the one-off, a no-code platform or a short contractor engagement is faster and cheaper. For the pre-PMF case, the answer is to validate with the cheapest thing that works before you commit a salary to it. For the cannot-vet case, you buy pre-vetted judgment from a partner until you have someone in-house who can grade an AI engineer's work, and the deeper version of that build-versus-buy call is its own decision I lay out in in-house versus outsourced AI.

Take a second fictional case. A founder named Priya ran a 12-person legal-tech company and was convinced she needed to hire an AI engineer to build a clause-summarization feature. We talked it through. The feature was a single hosted-model call with a careful prompt and a fallback. Her two backend engineers could ship it in a week and own it indefinitely. Hiring a specialist would have meant a four-month search and a senior salary to babysit one API call. She shipped it herself. Eighteen months later, when the AI surface had grown into something genuinely load-bearing, she hired, and by then she could actually evaluate the candidate. The wait was the right call, not a delay.

If you are weighing your own situation right now and want a second read before you open a role, that is exactly the kind of decision Devlyn's AI strategy and readiness work is built for. It is cheaper to be told you do not need the hire than to make it and find out.

What to try before you hire

Before you commit to a full-time AI engineer, there is a sequence of cheaper experiments that either solves the problem outright or tells you, with evidence, that you genuinely need the role. Run them in order.

First, prototype with an API or a no-code tool. Most AI features can be stood up in days with a hosted model and a thin wrapper, or with a no-code automation platform. This is not a toy step. It tells you whether the feature matters to users at all, and it does so before you have spent a single recruiting dollar. A surprising number of "we need an AI engineer" conversations end here, because the prototype either works well enough or reveals that nobody actually wanted the feature.

Second, point your strongest generalist at it and add evals. Give the work to the best engineer you already have and ask them to build a simple evaluation harness: a held-out set of inputs, the outputs you want, and a count of how often the model gets it right. If your generalist can drive quality up with that harness, you have bought yourself months. If they hit a wall they cannot explain, you have just generated the evidence that you need a specialist, which is a far stronger basis for the hire than a hunch.

Third, run a scoped pilot with a partner. If the work is real but you cannot yet vet a full-time hire, a partner with pre-vetted senior engineers can build the first version with defined success criteria, leave you a reference architecture your team can maintain, and transfer the judgment in the process. You get the capability now and the in-house readiness later. The pilot is not a trial. It is the period in which you both learn whether a permanent role is warranted.

Full-time vs fractional vs outsourced: the right first move

Suppose you have run the experiments and the answer really is yes, you need AI engineering capability. The next question is the form, and the default of "post a full-time req" is often the wrong first move.

Full-time is right when the work is permanent, central, and you have someone who can both manage and evaluate the hire. A standing role needs a standing owner, and a senior specialist will only thrive if there is someone above them who can tell good work from confident-sounding bad work.

Fractional is right when the work is real but not yet a full week of it, or when you need senior judgment on architecture and evals without a full salary. A fractional AI engineer sets the foundation, makes the load-bearing decisions, and hands the day-to-day to your generalists. It is the lowest-regret way to get senior judgment into the room early.

Outsourced is right when you cannot vet the hire, need the capability faster than a four-month search allows, or want the delivery risk carried by someone else until the work stabilizes. Pre-vetted senior engineers placed through a partner like Devlyn's AI application engineering clear a harder bar on live work than most internal interview loops can apply, which is the point: you are buying judgment you could not yet screen for yourself.

The trap, in every case, is hiring slowly and full-time for a role you cannot evaluate. That is the worst of all worlds: the longest time-to-value, the highest cost, and the greatest chance of a bad hire you will not detect until it is in production.

A decision table you can run today

Run your actual situation against this. The middle column is the verdict; the right column is the move.

Your situation	Need a dedicated AI engineer?	Do this
One-off feature, single launch, no standing maintenance	No	Ship it with a hosted API and a generalist; no hire
Pre-PMF, still validating the AI matters	No	Prototype with no-code or an API; validate before you spend
A wrapped API solves the task, your team can maintain it	No	Keep it in-house with existing engineers
Real, recurring work but you cannot yet vet the hire	Not full-time yet	Scoped partner pilot or fractional; build readiness
Senior judgment needed on architecture and evals, not a full week of work	Fractional	Bring in a fractional AI engineer to set the foundation
AI is recurring, core, and failing in ways no one can diagnose	Yes	Hire (or outsource first if you cannot vet), and assign an owner of correctness

If you land in a "No" row, the most useful thing you can do this quarter is not hire. If you land in the "Yes" row, the next question is sequencing, not sourcing.

Sequencing the first AI capability

Once the answer is yes, the order of operations matters more than the speed. Teams that get this wrong hire the person first and discover the foundations later, which means the expensive specialist spends month one doing work the company should have done before the req went out.

Start by writing down the failure you cannot tolerate. Not the feature you want, the failure mode that is unacceptable: the wrong invoice total, the hallucinated policy, the recommendation that loses trust. That definition is the spec for the hire and the spec for the evals, and it forces the specificity that vague AI ambitions never do.

Then build the evaluation harness before, or alongside, the hire, so that the first thing the new engineer inherits is a way to know whether they are winning. After that, decide the form using the table above, and only then open the search or call the partner. The full operating model around this hire, the roles, the cadences, and the evidence loops that keep machine output honest, is the subject of my book, Building an AI-Native Team. The sequencing discipline is dull and it is exactly what most teams skip.

Hire the person first and discover the foundations later, and the expensive specialist spends month one doing work you should have done before the req went out.

The cost of hiring too early

The reason I push so hard on the decision is that the wrong answer is genuinely expensive, and the expense is invisible until it has already happened. A senior AI hire is one of the costliest positions on the market right now. AI and machine-learning engineering is the single highest-demand technology role, with a national midpoint around $170,750 according to Robert Half's salary research, and the search itself runs months because the pool is small and contested.

That tightness is not anecdotal. Robert Half's demand research reports that 71% of technology leaders say skills shortages have already caused project delays, with AI integration the most affected category. So when you start a senior AI search before you need it, you are committing to a long, expensive process to fill a role whose justification is still a hunch.

Now add the failure rate. Widely reported industry surveys put the share of AI projects that fail to reach or survive production at well over half, and some estimates run past 80%, roughly twice the failure rate of non-AI technology work. I treat those figures as illustrative rather than precise, but the direction is not in doubt: most AI projects do not make it, and a wrong early hire ties your fate to a project that has not yet earned that bet. The all-in cost of a wrong senior hire, once you count recruiting, ramp, opportunity cost, and cleanup, runs well past the salary line, and you usually do not detect it until six months in. The full budget picture, in-house and outsourced, is in what an AI engineer costs.

The cheaper path is almost always to validate first and commit second. You do not lose the option to hire by waiting; you lose the option to un-hire by rushing.

Frequently asked questions

Does my company need an AI engineer to use AI?

No. Most uses of AI today are a hosted-model API behind a thin integration, and a competent generalist engineer can build and maintain that. You need a dedicated AI engineer only when AI work becomes recurring, core to the product, and starts failing in ways your existing team cannot diagnose. If you can ship it with an API and keep it running with the people you have, you do not need the specialist yet.

When do you need an AI engineer instead of an API or a no-code tool?

When the model is load-bearing and someone has to own whether it is correct. APIs and no-code tools are the right answer for one-off features, prototypes, and tasks a wrapped model solves cleanly. The moment the AI is in the critical path, failing on real inputs, and no one can explain or fix those failures, you have crossed from integration work into engineering work that warrants a dedicated owner.

Can a fractional or outsourced AI engineer work instead of a full-time hire?

Often, yes, and it is usually the better first move. Fractional engineers are right when you need senior judgment on architecture and evals without a full week of work. Outsourced or pre-vetted partners are right when you cannot yet evaluate a hire yourself or need the capability faster than a multi-month search allows. Both let you get the capability now and build in-house readiness before you commit a permanent salary.

What does hiring an AI engineer too early cost?

More than the salary. AI engineering is the highest-demand technology role, with a midpoint around $170,750 and searches that run months, so you spend heavily on a hunch. Add the high failure rate of AI projects and the all-in cost of a wrong senior hire, recruiting, ramp, opportunity cost, and cleanup, and a premature hire becomes a six-figure mistake you typically do not detect for half a year. Validate first, commit second.

If you are sitting with an AI idea or a stalling AI feature and genuinely cannot tell whether you need to hire, that is the decision Devlyn's AI strategy and readiness work exists to make for you, honestly, including when the answer is no. The fuller hiring picture lives in the hiring AI engineers guide, the definition of the role is in what an AI engineer is, and the team you build around the hire is the subject of Building an AI-Native Team. Decide whether you need the role before you decide who fills it. That order is the whole game.

The AI Skills Gap: What It Is and How to Fix It

Alpesh Nakrani — Sat, 21 Mar 2026 18:30:00 GMT

The AI skills gap is real, but the fix is not more training. Here is what the gap actually is, why it persists, and what leaders should do this quarter.

The AI skills gap is the distance between the AI work companies now expect their teams to do and the number of people who can actually do it well, and it persists because demand outran supply at the same moment the most important skill stopped being a thing you can teach quickly. That second part is the one most coverage misses. Producing AI output got easy. Knowing whether the output is correct did not, and that judgment is the scarce thing. The practical fix is not to wait for a training program to graduate your way out of it. It is to hire senior where it counts, partner for the work you cannot staff, and redesign the work around evaluation instead of generation.

I say this from both seats. I am an engineer who turned operator, and I have spent the last two years building customer-facing AI at Devlyn while also helping leaders hire for it. I have felt the gap as a hiring manager who could not fill a role, and I have felt it as the person who has to ship something correct on Monday regardless. The honest version of this problem is less comforting than the LinkedIn version, and more useful.

The gap has two halves. One half is ordinary demand-vs-supply lag that training narrows over time. The other half is new: the scarce skill is judgment, the ability to look at confident model output and know whether it is right.
Upskilling is slow because calibration is slow. You can teach the tools in weeks. You cannot teach the instinct for a plausible-but-wrong answer in weeks, and that instinct is the part that matters in production.
The realistic fixes are hire senior, partner, and redesign workflows. Each addresses the gap now instead of betting on a graduating class that arrives in a year.
The expensive mistakes are predictable. Juniors hidden behind AI, a headcount race, certificate theatre, and frontier-everything all feel like progress and all bury the real problem.

What the AI skills gap actually is

Most definitions of the AI skills gap stop at "there are more AI jobs than qualified people," which is true but shallow. The gap is really two different shortages wearing the same name. Treating them as one problem is why so many responses miss.

The first shortage is the obvious one. Companies decided, more or less all at once, that every team should be doing AI work, and the labor market did not have enough people who had done it before. ManpowerGroup's 2026 survey found that for the first time, AI skills are the single hardest category for employers to fill globally, ahead of every traditional engineering and IT skill, with 72% of employers reporting difficulty filling roles overall (ManpowerGroup, 2026). That is the demand-vs-supply half. It is real, it is large, and it behaves like every previous tech-skill shortage: painful now, narrowing slowly as people retrain.

The second shortage is the one that does not behave like the others. The work itself changed shape. When a model can produce a first draft, a working prototype, or a passable implementation in seconds, the bottleneck stops being "can someone produce this" and becomes "can someone tell whether this is right." That is a judgment skill, not a production skill, and it is genuinely scarce. Most of the people who look qualified on paper can drive the tools. Far fewer can catch the confident wrong answer before it ships.

So when a leader says "we cannot find AI talent," they are usually describing both shortages at once without separating them. They cannot find people who have done the work, and they cannot find people who can be trusted to judge the work. Those need different fixes, and conflating them leads to spending a training budget on the half that training cannot solve.

Why it persists: demand outran supply, and the new skill is new

The demand-vs-supply half persists for a boring, durable reason: demand for AI skills compounds faster than the pipeline that produces them. LinkedIn's 2026 data shows job postings requiring AI literacy grew more than 70% year over year, with AI engineering the single fastest-growing skill on the platform (CIO Dive, 2026). A talent pool does not grow 70% in a year. It grows at the speed people can plausibly retrain, which is much slower, so the gap widens even as more people enter it.

The judgment half persists for a deeper reason. The skill that now matters most is the one that takes the longest to build, because it is built from being wrong and learning why. A senior engineer can look at a model-generated implementation and feel, before they can fully articulate it, that something is off about the error handling or the way it treats an edge case. That feeling is compressed experience. It is the residue of having shipped the wrong thing before and paid for it.

You cannot shortcut that with a curriculum. A course can teach someone what retrieval-augmented generation is, how to write an eval, or how to structure a prompt. It cannot give them the thousands of small corrections that turn into calibration. This is the same point I make in my piece on org charts after automation: when generation is cheap, the scarce input is confident evaluation, and confident evaluation is the slowest thing to grow.

Producing AI output got easy. Knowing whether the output is correct did not, and that judgment is the scarce thing.

There is a quieter reason the gap stays open, too. The market is full of people who can demo. A polished demo and a production-grade system look identical for about ten minutes, and most hiring processes do not run longer than that on the dimension that matters. So companies hire for fluency, get fluency, and discover six months later that fluency was never the gap.

Why upskilling is slower than the dashboards suggest

Upskilling is the default answer to any skills gap, and for the demand-vs-supply half it genuinely helps. Teach your engineers the AI toolchain, give them real projects, and over a year you will have more capable people than you started with. I am not against it. I am against treating it as the whole answer, because the timeline is longer than the planning slide admits.

The reason is time-to-competence. The tools take weeks. The judgment takes a year or more of doing the work under real stakes, with someone senior in the loop who can say "no, look again, that is wrong and here is why." Without that senior presence, an upskilling program produces people who are confidently mediocre, which is worse than honestly junior, because confidently mediocre output passes review and reaches customers.

I watched a team I was advising run exactly this play. They put their backend engineers through an intensive AI program, declared the gap closed, and shipped an AI feature three months later. It worked in the demo and failed quietly in production, returning answers that were fluent and wrong often enough to erode trust before anyone caught the pattern. The training was fine. What was missing was anyone with the calibration to notice the failure mode early, and no three-month program produces that.

The lesson is not "do not train." It is "do not let the training timeline set your product timeline." Upskilling is a multi-year investment in your bench. It is not a way to staff a launch this quarter, and pretending otherwise is how the gap turns into shipped errors. If you want the honest version of how long real competence takes, the difference between a senior and a junior AI engineer is mostly this exact thing.

The gap, driver by driver

It helps to lay the drivers out plainly, because each one points to a different response. The mistake is applying one response to all of them.

Gap driver	Why it persists	The response that works
Demand for AI skills grew faster than the talent pool	Postings compound; retraining does not keep pace	Hire senior for the core; partner for surge and specialist work
The scarce skill is judgment, not tool fluency	Calibration is built from being wrong over years, not taught in a course	Hire for evaluation ability; put a senior in every review loop
Demos and production look the same in interviews	Hiring processes test fluency, not failure-catching	Test candidates on evaluating output, not producing it
Upskilling timelines are longer than launch timelines	Tools take weeks; judgment takes a year-plus under real stakes	Decouple the product timeline from the training timeline
The work itself changed shape	Generation is cheap; review is the new bottleneck	Redesign workflows around evaluation, not artifact throughput

The fixes that actually work

Three responses move the needle now, and they work together rather than competing. None of them is "wait for the market to catch up."

Hire senior where judgment is load-bearing. The posture I run at Devlyn is senior engineers only, no juniors hidden behind AI. That is not a slight on junior engineers; it is a statement about what the work needs right now. The leverage available to one senior person with good tooling now covers what used to take several people beneath them, and crucially, that senior person can tell whether the machine's output is correct. One engineer you trust to catch the confident wrong answer is worth more than three who can only produce. The trade-offs and real numbers behind this are in what it actually costs to hire an AI engineer.

Partner for the work you cannot staff. Not every role should be a full-time hire, and the gap makes that more true, not less. There is specialist work, surge work, and net-new product work where waiting six months to fill a seat costs more than the seat. Partnering lets you put senior judgment on the problem now and keep your permanent headcount focused on the core. This is the half I sit on at Devlyn, so read it with that in mind, but the logic holds independent of who you partner with: buy the judgment you cannot grow fast enough internally.

Redesign the workflow around evaluation. This is the fix that costs nothing and gets skipped most. If generation is now the cheap part, your process should spend its scarce human attention on review, not production. That means explicit eval gates before anything ships, named owners for quality, and a culture where catching a wrong answer is celebrated rather than treated as friction. A team that has reorganized around evaluation gets more out of the people it already has, which directly narrows the gap without hiring anyone. The full framework for building a team this way is in Building an AI-Native Team.

What NOT to do

The failure modes here are predictable, and they all feel like progress while they are happening. I have made or watched every one of these.

Do not hide juniors behind AI and call the gap closed. It is tempting to staff cheaply and assume the model lifts everyone to senior output. It does not. It lifts everyone to senior-looking output, which is a different and more dangerous thing, because the wrong answers now arrive wrapped in fluent prose that passes a quick read. You have not closed the gap; you have hidden it inside work that looks fine until it is in front of a customer.

Do not run a headcount race. The instinct when you cannot find talent is to widen the funnel and hire more, faster. But if the scarce skill is judgment, more bodies without judgment just means more output you cannot trust and more review load on the few people who can evaluate. Scaling the wrong input makes the bottleneck worse, not better.

Do not mistake certificate theatre for capability. A wall of AI certifications tells you someone completed a course. It tells you nothing about whether they can catch a hallucination in a domain that matters to you. Test for the actual skill, which means showing candidates real output with real flaws and watching whether they find them, an approach I lay out in what actually separates good AI engineers.

Do not default to the frontier model to paper over a skills problem. When a team lacks the judgment to make a smaller, cheaper system work, the easy move is to throw the biggest model at everything and hope capability covers for the missing evaluation discipline. It does not, and it taxes your unit economics for the privilege. A skills gap dressed up as a compute bill is still a skills gap.

A skills gap dressed up as a compute bill is still a skills gap. The frontier model does not give you the judgment you were missing.

How leaders should respond to the AI skills gap this quarter

The decision in front of most leaders is not "train or hire." It is how to allocate three levers across a problem that has two halves. Here is the frame I use when leaders ask me directly.

For the demand-vs-supply half, invest in upskilling as a multi-year bench-building program, and be honest that it pays off in years, not quarters. Put your best senior people in the loop with the people you are training, because that proximity is where calibration actually transfers. Do not let this program's timeline set your product roadmap.

For the judgment half, which is the half that bites this quarter, hire senior for the roles where a wrong answer is expensive, and partner for the work you cannot staff in time. Then do the free thing: redesign your workflow so human attention is spent on evaluation rather than generation, with named owners for quality and explicit gates before anything reaches a customer. The order you build this team in matters more than people expect, which is why I wrote the order you actually build an AI team in.

The companies that come out of this ahead will not be the ones that hired the most or trained the most. They will be the ones who understood that the gap was mostly a judgment problem and built their organization to concentrate judgment where it pays. The rest will keep hiring fluency and wondering why the gap will not close. If you want help putting senior AI judgment on a problem you cannot staff for right now, that is exactly the work we do at Devlyn.

Frequently asked questions

What is the AI skills gap?

It is the distance between the AI work companies expect their teams to do and the number of people who can do it well. It has two halves: an ordinary demand-vs-supply shortage that training narrows over time, and a newer shortage of judgment, the ability to evaluate whether confident model output is actually correct. The second half is the one that does not close quickly, because calibration is built from experience, not taught in a course.

Is the AI talent shortage real or hype?

Real. ManpowerGroup's 2026 survey found AI skills are now the single hardest category for employers to fill globally, ahead of every traditional engineering skill, and LinkedIn data shows AI-related postings growing more than 70% year over year against a talent pool that grows far slower. The hype is not in the existence of the gap; it is in the idea that a training program alone closes it.

Can upskilling close the AI skills gap?

Partly, and slowly. Upskilling reliably teaches the tools in weeks and helps the demand-vs-supply half over a multi-year horizon. It does not quickly produce the judgment to catch a plausible wrong answer, which takes a year or more of real work under senior supervision. Use upskilling to build your bench, not to staff a launch this quarter.

Should I hire or partner to fix an AI skills gap?

Both, applied to different problems. Hire senior for the core roles where a wrong answer is expensive and you can find the person. Partner for specialist work, surge capacity, and net-new product work where waiting six months to fill a seat costs more than the seat. Either way you are buying judgment you cannot grow internally fast enough, which is the constraint that actually matters.

The Cost of a Bad AI Hire (It Is Not the Salary)

Alpesh Nakrani — Fri, 20 Mar 2026 18:30:00 GMT

The cost of a bad AI hire is not the salary you wasted. It is the un-evaluated system they shipped, the roadmap that stalled, and the trust your team lost.

The cost of a bad AI hire is not the salary. The salary is the receipt you can see; the real cost is the un-evaluated AI system that person shipped, the four months of ramp you paid for before it broke, the roadmap that did not move while they were here, and the senior engineers you pulled off real work to clean it up. Add it up and a single bad AI hire on a senior seat routinely runs six figures, most of it invisible until a customer finds it for you.

I have hired and deployed more than 80 senior AI engineers at Devlyn, and I have also paid for the wrong ones. I sit in two seats at once: I read the model traces, and I read the P&L. From that seat, the salary line is the part of a bad hire I worry about least. This piece is the cost-of-failure deep-dive that branches off my pillar guide to hiring AI engineers, and I am going to give you the whole number, including the part of it that is specific to AI and that no generic hiring article will ever quote you.

The salary is the floor, not the cost. Generic benchmarks put a bad hire at 30% to 50% of annual salary; for a senior AI engineer the all-in number is several times that once you count ramp, opportunity, and cleanup.
The AI-specific cost is silent failure. A bad AI hire ships a system with no evals and no observability; it demos clean and fails in production, where the failure surfaces as churn, not a stack trace.
Opportunity cost is the largest line. The roadmap that did not ship while a weak hire ramped and was replaced usually dwarfs the direct replacement cost.
Morale compounds the bill. Seniors pulled into cleanup ship less and trust the AI less; that drag outlasts the person who caused it.
You can compute your own number, and you should. A finance-ready estimate is the cheapest insurance against making the same hire twice.

If you are weighing this cost right now because a hire is not working, or because you are about to make one and want the math first, the fastest way to take the hiring risk off the table is to start with vetted senior engineers on a trial. That is exactly what Devlyn's AI application engineers are for: senior, trial-first, priced as an outcome rather than a gamble.

The visible cost everyone quotes, and why it is the small number

Start with the number you can already find. Generic hiring research puts the cost of a bad hire at roughly 30% of the employee's first-year earnings, a figure commonly attributed to the U.S. Department of Labor, and the widely cited 50% to 200% replacement range from HR sources scales with seniority. A CareerBuilder survey famously pegged the average bad hire at about $14,900, with 74% of employers admitting they had hired the wrong person. Those numbers are real, and they are the floor.

The visible cost is the stuff that hits an invoice. It is the salary you paid for the months the person was on the team. It is the recruiter fee or the sourcing time, the severance if there was any, and the cost of running the search a second time. For a US senior AI engineer at a $180,000 base, even the conservative benchmark math lands you in the $54,000 to $90,000 range before you have priced anything that does not appear on a payroll report.

I call this the small number on purpose. It is the part of the cost that finance already understands and that every cost-per-hire calculator will hand you. It is also, in my experience, less than half of what a bad AI hire actually takes out of the business. The benchmark articles stop here because the rest of the cost is hard to see and harder to attribute, which does not make it any smaller.

The hidden cost: ramp, opportunity, and morale

A senior AI engineer does not ship production-grade work on day one. There is a ramp: learning your stack, your data, your model behavior, your definition of good. You pay full salary through all of it, and with a bad hire you discover at the end of that ramp that the work does not hold up. The ramp was not an investment that paid off; it was a sunk cost you cannot recover.

Opportunity cost is the line that almost nobody computes and that is almost always the largest. While the wrong person occupied the seat, the roadmap did not move; the feature that should have shipped in Q2 slips to Q4. The competitor who shipped it in Q2 takes the customers who would have been yours. That gap does not show up on any invoice, but it shows up in revenue, and it is frequently larger than salary, ramp, and replacement combined.

Then there is morale, which is real money wearing soft clothes. When a hire ships work that breaks, your good engineers stop their own work to fix it and review everything more defensively. They learn, quietly, that the AI in your product cannot be trusted, and that distrust slows every future decision. I have watched one weak hire turn a fast team into a cautious one, and the caution outlasted the person by quarters.

A bad AI hire does not cost you a salary. It costs you a quarter of roadmap, a tax on every senior on the team, and a system you now have to earn back trust in.

The AI-specific cost nobody prices: the un-evaluated system that fails silently

Here is the cost that makes a bad AI hire categorically worse than a bad hire in any other engineering role. In most software, a weak engineer writes code that breaks loudly. It throws an exception, the build goes red, a test fails, and you find out in CI before a customer ever sees it. The failure is visible, and visibility is mercy.

AI does not fail that way. A bad AI hire ships a system that compiles, demos beautifully in the room, and is confidently wrong in production, with no evals behind it because the person did not know to build them or did not bother. There is no observability either, so when the model starts hallucinating on real traffic, nothing turns red. The failure surfaces as a customer who got a wrong answer, lost trust, and left, and you learn about it from churn and support tickets weeks later, long after the cause is cold.

This is not a rare edge case; it is the dominant pattern. RAND's 2024 research found that more than 80% of AI projects fail, roughly twice the failure rate of non-AI IT projects. A large share of that failure is judgment that was never evaluated for: shipping un-instrumented systems, mistaking a clean demo for a production-ready one, and treating the model's confident output as correct output. That is precisely the judgment a good vetting process screens for and a bad hire lacks.

Let me make it concrete with an NDA-safe composite. A team I advised hired a strong-on-paper engineer who shipped a customer-facing assistant that tested perfectly on the handful of prompts they tried in the demo, then confidently gave wrong account information to a slice of real users. There were no evals, no logging of model outputs, and no alerting. The team found out from a spike in support escalations, spent three senior-weeks reconstructing what the system had been telling people, and spent more rebuilding customer trust than they ever spent on the build; the salary was the cheapest part of that quarter.

A cost breakdown you can put in front of finance

Here is the full cost of a bad AI hire, broken into the lines that actually make it up. The figures are illustrative, modeled on a US senior AI engineer at a $180,000 base who is on the team for about five months before being replaced. Your numbers will differ; the structure will not.

Cost line	What it is	Illustrative figure	Who absorbs it
Wasted salary and benefits	Fully loaded comp for months on the team with no durable output	$90,000	Finance
Recruiting and replacement	Sourcing, fees, and running the search a second time	$25,000	Talent / HR
Ramp written off	Onboarding and senior time spent bringing them up that returns nothing	$30,000	Engineering
Cleanup and rework	Senior engineers pulled off roadmap to fix the un-evaluated system	$60,000	Engineering
Opportunity cost	Roadmap that did not ship; revenue and position lost to the delay	$120,000+	Revenue / the business
Trust and morale	Defensive review, slower decisions, churn from production failures	Hard to price, rarely zero	Everyone

The visible lines at the top sum to about $115,000. The hidden lines below them are where the number more than doubles, and the largest of those is the opportunity cost that no calculator will hand you. This is why I tell founders that the salary is the receipt, not the bill. The bill arrives later, in pieces, from departments that never approved the hire.

How to compute the cost of a bad AI hire for your team

You do not need a perfect figure; you need a defensible one. The formula is simple enough to run on a napkin, and running it once before a hire is the cheapest diligence you will ever do.

// Cost of a bad AI hire, illustrative formula wasted_comp = monthly_loaded_cost * months_on_team replacement = recruiting_fee + second_search_cost ramp_writeoff = onboarding_weeks * senior_weekly_cost cleanup = cleanup_weeks * senior_weekly_cost opportunity = delayed_revenue_or_roadmap_value total = wasted_comp + replacement + ramp_writeoff + cleanup + opportunity // morale and churn are real; add a deliberate buffer, do not set them to zero

Work a quick example. A senior at $180,000 base is roughly $20,000 a month fully loaded; five months on the team is $100,000 in comp before anyone ships. Add $25,000 to replace them, $30,000 in ramp you wrote off, and six senior-weeks of cleanup at about $10,000 a week, and you are at $215,000 before opportunity cost. Put any reasonable number on the quarter of roadmap that slipped, and the total clears a quarter of a million dollars for a single hire.

That math is the argument for spending more on vetting and seniority up front, not less. Every dollar you move from cleaning up a bad hire to preventing one buys you leverage, because prevention is cheap and the failure is expensive. For the other side of this ledger, what a good AI engineer actually costs to hire well, I wrote a full breakdown in the AI engineer cost guide.

How to avoid the cost of a bad AI hire: vet, hire senior, or partner

The cost of a bad AI hire is almost entirely preventable, and the prevention is not mysterious. It comes down to three moves, in order of how much risk they take off the table.

First, vet for judgment, not vocabulary. The engineer who can describe RAG is not the same as the one who will instrument it, evaluate it, and refuse to ship it without observability. Vet for the habits that prevent silent failure: do they build evals, do they log model outputs, do they distrust a clean demo. My guides on how to vet AI engineers and the red flags that predict a bad AI hire are the screening playbook I actually use.

Second, bias toward seniority on anything that touches production. The single most expensive mistake I see is hiring a junior into a role that needs judgment and hoping the title grows into the work. In AI, where failure is silent, the gap between senior and junior is not speed, it is whether the system fails safely; I made the full case in senior vs junior AI engineer.

Third, if you cannot absorb the hiring risk, do not absorb it. The reason a transparent-rate partner exists is to move the variance off your books: vetted senior engineers, a trial before you commit, and an outcome you can hold rather than a resume you have to bet on. That is the model behind Devlyn's AI application engineers, and it is the most direct way to make the cost of a bad AI hire someone else's problem to underwrite. The deeper framework for building a team that does not generate these costs is in my book, Building an AI-Native Team.

Frequently asked questions

What is the real cost of a bad AI hire?

The benchmark floor is 30% to 50% of annual salary, but for a senior AI engineer the all-in cost commonly runs $150,000 to $300,000 or more once you add wasted ramp, senior cleanup time, lost roadmap, and the production damage from an un-evaluated system. The salary is the smallest part of that total. The largest is usually the opportunity cost of the roadmap that did not ship while the wrong person held the seat.

Why is a bad AI hire more expensive than a bad hire in other roles?

Because AI fails silently. A weak engineer in most roles writes code that breaks loudly in CI, so you catch it before a customer does. A bad AI hire ships a system with no evals and no observability that demos clean and is confidently wrong in production, so the failure surfaces as churn and support tickets weeks later, when the cause is cold and the trust is already spent.

How do I calculate the cost of a bad hire for my own team?

Add wasted loaded comp for the months on the team, recruiting and second-search costs, the ramp you wrote off, senior cleanup time, and the value of the roadmap that slipped, then add a deliberate buffer for morale and churn rather than setting them to zero. Run that number before you hire, not after. A defensible estimate beats a precise one, and it is the cheapest diligence you will do.

How do I avoid the cost of a bad AI hire?

Vet for judgment rather than vocabulary, bias toward seniority on anything that touches production, and if you cannot absorb the hiring risk, use a trial-first partner so the variance lives on someone else's books. The habits that prevent silent failure, building evals and instrumenting outputs and distrusting a clean demo, are screenable, and screening for them up front is far cheaper than cleaning up after a hire who lacks them.

If the math in this piece describes a hire you are about to make, or one you are currently regretting, the lowest-risk path is to start with vetted senior engineers on a trial rather than betting a quarter of roadmap on a resume. That is what Devlyn's AI application engineers are built for. Price the bad hire honestly, then make sure you never pay it twice.

How AI Changed Software Hiring

Alpesh Nakrani — Thu, 19 Mar 2026 18:30:00 GMT

How AI changed software hiring comes down to one shift: it moved the thing you are actually hiring for. The job used to be throughput, how fast and how much code a person could produce; now the job is judgment, whether they can tell good output from plausible output once the machine has written it. Generation got cheap, so the scarce skill is no longer producing the artifact. It is evaluating it.

I have hired on both sides of this line. For most of my career as an engineer, I screened candidates the way everyone did: can this person write clean code, fast, under a little pressure. Now, running hiring at Devlyn, I screen for something the old interview was never designed to measure. The candidates who clear the bar are the ones who can look at what a model generated and tell me, precisely, what is wrong with it.

This is not a story about AI replacing engineers. It is a story about the bar moving, and most hiring processes not having moved with it. If you are still running the interview you ran in 2021, you are selecting for a skill the market has quietly stopped paying a premium for. If you would rather skip the rebuild and hire engineers already selected for judgment over throughput, that is what my team does at Devlyn.

The thing you hire for moved. Generation went from scarce to cheap, so the premium shifted from producing code to judging it. Throughput is now table stakes; judgment is the differentiator.
The junior squeeze is structural, not a verdict. The tasks entry roles were built around are exactly the tasks AI now does, and the hiring data shows it. That is a pipeline problem, not proof that juniors are worthless.
Interviews have to test evaluation, not just generation. Hand a candidate generated output with a subtle flaw and ask what is wrong with it. That single move surfaces more signal than a clean coding sprint.
AI is now on both sides of the table. Candidates use it on take-homes; recruiters use it to screen. The old proxies for skill leak, so you have to design around the leak.
The fundamentals did not change. Systems thinking, communication, accountability, and trust still decide who is worth hiring. AI raised the floor on production and the ceiling on judgment.

Generation got cheap, so judgment became the job

Here is the mechanism, stated plainly. For decades, the bottleneck in shipping software was production: code had to be written by a human, one line at a time, so you hired to that bottleneck. You hired people who could write more code, faster, with fewer bugs, and you built interviews to find them. The whiteboard, the timed coding challenge, the take-home that asked for a working feature: every one of those was a throughput test in disguise.

Then generation got cheap. A capable model now drafts a function, a test suite, a migration, a first pass at a feature in seconds. The Stack Overflow 2024 Developer Survey found that 76% of developers were using or planning to use AI tools, up from 70% the year before, with 62% already using them daily, nearly half again the prior year's 44% (survey.stackoverflow.co); by the 2025 survey that figure had climbed to 84% (stackoverflow.blog). When most of your engineers are generating with a machine, the volume of code stops being the constraint.

So the bottleneck moved. It moved from "can we produce this?" to "is this correct, and is it the right thing?" The machine fills the canvas now. The open question is whether anyone on your team can look at the result and know, with confidence, whether it is sound, whether it is secure, whether it will hold up at scale, and whether it solved the actual problem rather than a plausible-looking adjacent one.

That is judgment, and it is much harder to hire for than throughput. Throughput you can count: lines, commits, tickets closed, a feature that compiles by the end of the session. Judgment does not show up on a counter; it shows up in the questions a person asks before they write anything, and in their ability to spot the confident wrong answer that a less experienced engineer would have shipped. The whole interview has to be rebuilt around surfacing that, which is the practical heart of Building an AI-Native Team, where I lay out how to interview for judgment instead of speed, and it is the work my team does for hirers at Devlyn.

The machine fills the canvas now. The open question is whether anyone on your team can look at the result and know whether it is actually right.

The junior squeeze is real, and the data says so

The most visible part of the AI impact on software hiring is who gets hired at the bottom. The work that traditionally went to junior engineers, implementing a clearly specified ticket, writing boilerplate, fixing well-scoped bugs, is precisely the work that AI does competently now. So the demand for that work, as a thing you pay an early-career human to do, fell.

This is not a hunch. A Stanford study, "Canaries in the Coal Mine?" by Brynjolfsson, Chandar, and Chen, found a 13% decline in employment for workers aged 22 to 25 in AI-exposed roles like software development through mid-2025, while employment for workers 26 to 55 in the same occupations stayed stable or rose (LeadDev). Stack Overflow's reporting puts the drop in early-career developer employment closer to 20% since late 2022, and notes that 37% of employers said they would prefer AI over a recent graduate for some tasks (stackoverflow.blog). The pattern is consistent: the squeeze lands hardest where the work overlaps most with what a model can do.

I want to be careful here, because the doom framing gets this wrong. The data does not say junior engineers are worthless. It says the specific bundle of tasks we used to package into an entry-level job got automated, and we have not yet repackaged the role around what is left. That is a design failure on the employer side, not a character verdict on a generation of engineers.

It also creates a problem nobody has solved cleanly: if you stop hiring juniors because AI does junior work, you stop manufacturing seniors. Senior judgment is not downloaded; it is accumulated, by doing the work, making mistakes, and watching outcomes. A company that hires only seniors is, in effect, free-riding on apprenticeship that happened somewhere else, and that well runs dry. The honest answer is that the junior role has to be redesigned around judgment from day one, not deleted, but most teams have not done that work yet.

What employers now screen for: how AI changed software hiring criteria

If the job is judgment, the interview has to test judgment, and most interviews do not. Here is what I have moved toward, and what I see the better hiring teams converging on.

First, evaluation over generation. The single highest-signal move I have found is to hand a candidate a piece of AI-generated code or design that contains a subtle, real flaw, a security hole, a wrong assumption about scale, a misread requirement, and ask: what is wrong with this, and what would you need to know before you shipped it? Engineers with judgment dig in and find the problem. Engineers trained for throughput pivot immediately to rewriting it their way, which tells me they cannot evaluate, only produce.

Second, specification. Can this person take a vague problem and sharpen it into a precise spec before any code exists? In an AI-native workflow, the quality of the output is bounded by the quality of the spec that constrained it, so a person who writes fuzzy specs gets fuzzy generated work no matter how good the model is. I watch how candidates make sense of ambiguity: what they ask, what they assume, what they refuse to assume.

Third, ownership and taste. Will this person own an outcome end to end, and do they have a felt sense of what good looks like in this domain, the thing that lets them reject a technically-correct answer that is wrong for the user? These are harder to test, but you surface them by giving real, ambiguous work from your actual domain and watching how someone navigates it, which is closer to how I think about the difference between a senior and a junior AI engineer than years of experience ever was. The deeper list of what to look for lives in the AI engineer skills that actually separate the good ones.

AI moved into the hiring process itself

The other half of the story is that AI did not just change what you hire for. It changed the hiring process itself, on both sides of the table, and that broke a lot of the proxies hiring used to lean on.

On the candidate side, take-home assignments quietly stopped working. When a model can complete most take-homes in minutes, a polished submission tells you almost nothing about the person who submitted it. I have seen flawless take-homes from candidates who could not, in a live conversation, explain why their own code made the choices it made. The artifact is no longer evidence of the skill; the conversation about the artifact is.

On the employer side, AI screening tools now sit between applicants and humans, parsing resumes, ranking candidates, sometimes running first-round interviews. This cuts both ways. It lets a small team process a flood of applications, but it also optimizes for whatever the model rewards, which is often keyword-matching and confident phrasing rather than actual judgment. If you are not careful, you build a funnel that filters for people good at being parsed by AI, which is not the same as people good at the job.

My practical response has been to move signal back to live, unscripted interaction, and to assume the artifacts are AI-assisted. I no longer ask "did you write this?" because the honest answer is usually "the model and I wrote it together," and that is fine, that is the actual job. I ask "walk me through why it does this, and what would break it." That question cannot be faked by a tool, because it tests the judgment that sits behind the artifact, not the artifact itself.

The artifact is no longer evidence of the skill. The conversation about the artifact is.

What did not change

It would be easy to read all of this as "everything is different now," and that is not true. A surprising amount of what made a good hire a good hire is exactly what it was. AI raised the floor on production and the ceiling on judgment, but it did not touch the fundamentals.

Systems thinking still decides who can build something that survives contact with real traffic. Communication still decides who can align a team and write a spec that a human or a model can act on. Accountability, the willingness to own a bad outcome instead of pointing at the ticket, still separates the people you want from the people you tolerate. None of those got easier or cheaper; if anything they matter more, because the leverage per person went up, so a single person's judgment now moves more output.

Trust did not change either. When a candidate gives you a confident answer, you still have to decide whether to believe them, and the cost of believing the wrong person is higher now, not lower, because that person is steering a machine that produces fast. A plausible wrong hire used to produce a manageable amount of bad code; a plausible wrong hire with AI tooling produces a lot of bad code, quickly, that looks fine until it does not. The premium on getting the trust judgment right went up.

What it means for how you build a team

Put the pieces together and the shape of the team changes. The senior-to-junior ratio tilts senior, because one senior who can architect and evaluate now covers what used to take several producers beneath them. The flatter, judgment-dense structure I described in what a team is for after the machine does the work is the direct organizational consequence of the hiring shift in this article.

But tilting senior creates the apprenticeship problem I flagged earlier, and the teams thinking clearly about this are not deleting the junior role. They are redesigning it. Instead of hiring a junior to produce code, they hire a junior to learn judgment fast: pairing them with seniors on evaluation, putting them on real ambiguity early, and accepting that the first year is about building calibration rather than shipping volume. That is more expensive per head and slower to pay off, which is exactly why most teams skip it and then complain about the senior shortage they helped create.

The macro version of this is what I call the judgment economy: as execution commoditizes, value concentrates in the humans who can direct and evaluate, not the ones who produce. Hiring is where that abstraction becomes concrete. Every req you open in software hiring in the AI era is a small bet on whether you are buying throughput, which is now cheap, or judgment, which is not. The teams that internalize this early, and rebuild their interviews around it, will out-hire the ones still running the 2021 loop, which is the through-line of my AI-native thesis and the framework in building an AI team.

If you are working through what to screen for and how to structure the loop, that is the problem my team solves every day. We hire and place engineers selected for exactly this, the ability to judge AI output, not just generate it, at Devlyn.

Before and after: how AI changed software hiring, dimension by dimension

Here is how AI changed software hiring laid out in one view, the old hire against the new one.

Hiring dimension	Pre-AI	AI era
What you screen for	Production speed and code volume	Judgment: can they evaluate what the model produced
The take-home	Strong signal of skill	Weak signal; the model can do it. The walkthrough is the signal
Junior role	Implement specified tickets, build craft over time	Tasks automated; role must be redesigned around judgment
Senior-to-junior ratio	Pyramid: many producers, few architects	Tilts senior; one evaluator covers several old producer roles
Interview question	"Build this feature"	"What is wrong with this generated code, and what would break it"
Who is on the other side	A human reviewing a human's work	AI on both sides: candidate generates, recruiter screens
The scarce skill	Writing correct code fast	Knowing whether code is correct, fast

Read down the right column and the message is consistent. The pre-AI hire was optimized for a production constraint that has largely dissolved. The AI-era hire is optimized for the constraint that replaced it, which is confident, fast evaluation. This is the same pattern I drew at the organizational level in the definitive guide to hiring AI engineers, the pillar this article sits under, where the full hiring framework lives.

Frequently asked questions

How did AI change software hiring? AI moved what you hire for: because generation got cheap, the premium shifted from producing code fast to judging whether the code a model produced is correct and right. Throughput became table stakes, and judgment, the ability to evaluate, specify, and own an outcome, became the differentiator. The interview has to be rebuilt to test that, because the old throughput tests no longer separate strong candidates from weak ones.

Are companies still hiring junior software developers? Less, and the data is clear about it: a Stanford study found a 13% employment decline for workers aged 22 to 25 in AI-exposed roles through mid-2025, while older cohorts stayed stable. But the honest read is that the tasks the junior role was built around got automated, not that juniors have no value. The teams thinking clearly are redesigning the junior role around learning judgment early rather than deleting it, because if you never hire juniors you eventually stop producing seniors.

What do employers screen for in AI-era software hiring? Evaluation, specification, ownership, and taste, with communication, systems thinking, and accountability mattering as much as they ever did. The highest-signal interview move is to hand a candidate AI-generated work with a subtle flaw and ask what is wrong with it and what they would need to know before shipping it. People with judgment find the problem; people trained for throughput rewrite it their way without spotting the flaw.

Should I worry about candidates using AI on take-home assignments? Assume they will, and stop treating the artifact as evidence, because a model can complete most take-homes in minutes and a polished submission tells you little about the person. Move your signal to live, unscripted conversation about the work: ask candidates to walk you through why the code makes its choices and what would break it. That tests the judgment behind the artifact, which is the thing you are actually hiring for.