'A human reviews it' is not a plan

Putting a person in the loop feels safe and scales terribly. The reviewer becomes a bottleneck, then a rubber stamp, then a liability.

Every AI rollout I have seen in the last two years has the same slide. It shows a flowchart with boxes for "model generates output," "human reviews output," and then a green arrow labeled "approved." The presenter clicks past it in under ten seconds. Nobody asks questions. It feels self-evidently responsible. Of course a human reviews it. What kind of reckless organization would skip that step?

The problem is that "a human reviews it" is a sentence fragment pretending to be a policy. It answers none of the questions that actually determine whether oversight does anything. Which human? Reviewing what, exactly, the raw output, the downstream effect, the user-facing rendering? Against what rubric? With what authority to reject, revise, or escalate? At what sampling rate? Within what latency window? What happens when that person is out sick, on a call, or has 400 items in their queue?

I have been thinking about this ever since we started running agentic workflows at Devlyn at real volume, thousands of interactions a day, not hundreds. "Human in the loop" as a phrase does not survive contact with a production system at that scale. What survives is a design. And most organizations have not designed anything. They have named a person and called it a process.

The queue collapse you do not see coming

Here is how it usually goes. The pilot looks great. Volume is low, the reviewer is a subject matter expert who is also deeply motivated because this is new and interesting. Approval latency is two hours. Quality is excellent. Leadership signs off on expansion.

Volume doubles. Then doubles again. The reviewer's queue grows from 20 items a day to 80 to 300. They start skimming. They stop reading the full output and start reading the first paragraph and the final sentence. They begin trusting the model more than they should, not because they are lazy but because the model has been right 97 times in a row and the alternative is staying at the office until 9 p.m. every night. The approval rate for genuinely bad outputs climbs from near zero to somewhere uncomfortable, but nobody notices because nobody is measuring it.

This is not a story about bad people. It is a story about a system that was never designed for the load it was asked to carry. The reviewer was a single point of failure, and the failure mode was invisible, not a crash, not an error log, just a slow drift toward rubber-stamping. The human in the loop became a human downstream of the loop, signing off on whatever the model decided.

The failure mode is not a crash. It is a slow drift toward rubber-stamping, a human who is technically reviewing and functionally not.

I have seen this happen in three different companies in the last eighteen months, in domains ranging from customer-facing copy to clinical documentation summaries to financial recommendations. The dynamics are identical. The timeline varies by volume, but the end state is the same: a reviewer who is now a liability, not a safeguard, because decisions made under their nominal oversight carry the institutional weight of human approval without the substance of human judgment.

The incomplete sentence problem

When I say "human in the loop is an incomplete sentence," I mean it structurally. A proper oversight design requires answers to at least six questions before it can be evaluated as a plan.

Who reviews? Not a job title, a specific person or rotation, with defined backup coverage. If the answer is "whoever is available," the answer is nobody.

What do they review? The raw model output, the post-processing result, the user-visible artifact, the logged metadata? Each has different failure modes. Reviewing only the output and not the downstream rendering has caused some of the most embarrassing AI incidents I know of.

Against what rubric? An expert reviewing without a rubric is not evaluating, they are pattern-matching to intuition. Intuition is valuable, but it cannot be transferred, calibrated, or measured. If two reviewers cannot agree on what "good" looks like for the same output, the process is producing noise.

With what authority? Can the reviewer reject? Modify? Escalate? If rejection means the item goes back into the queue and gets re-reviewed by the same person with no additional signal, rejection is a gesture, not a control.

At what sampling rate? 100% review is a specific operational bet that review adds more value than it costs. That is sometimes true and often not. Most organizations default to 100% because it sounds rigorous, not because they have done the math.

At what latency? A review process with a 72-hour turnaround in a customer-facing context is not oversight, it is post-hoc audit with extra steps. The action has already happened. The damage, if any, has already propagated.

I wrote more about this framing in "Human in the Loop Is Not a Plan", the core argument is that oversight is only meaningful when it is specified precisely enough to be falsifiable. If you cannot describe what a failure of your oversight process looks like, you have not designed an oversight process.

Risk-tiered review and the capacity math

The practical alternative to "a human reviews everything" is not "nothing gets reviewed." It is a tiered model based on output risk and model confidence, with explicit capacity math for each tier.

At Devlyn, we think about this in three buckets. High-stakes outputs, anything that affects a patient's treatment path, a prescription, or a financial commitment, get 100% human review with a defined rubric, a credentialed reviewer, and a documented decision trail. Mid-stakes outputs, scheduling changes, care-plan summaries, patient communications, get sampled review at a rate calibrated to our current confidence in the model's error rate. Low-stakes outputs, administrative confirmations, appointment reminders, internal workflow nudges, go with model confidence scoring and exception flagging only.

The capacity math looks roughly like this:

Tier 1 (100% review): 200 outputs/day × 4 min/review = 800 min/day = 2 FTE dedicated reviewers
Tier 2 (10% sample): 2,000 outputs/day × 10% × 3 min/review = 100 min/day = fractional reviewer
Tier 3 (exception only): 8,000 outputs/day × 0.5% exception rate × 5 min/review = 200 min/day

Total review load at 10,200 outputs/day: ~1,100 min/day, manageable with 3 reviewers
If all 10,200 went to Tier 1: 40,800 min/day, you would need 17 reviewers

The math is not the interesting part. The interesting part is that most organizations never do it. They start with "a human reviews it," watch the queue back up, quietly reduce the review burden by skimming or sampling without ever formally changing the policy, and end up with the worst of both worlds: the liability of claiming 100% review with the actual coverage of something much lower.

The formal tiering forces you to make the tradeoff explicit. You are stating, for the record, that you believe low-stakes outputs with high model confidence do not warrant line-by-line human review, and here is the evidence that belief is calibrated. That is a defensible position. "We reviewed everything" that is actually "one exhausted person approved everything in their queue" is not.

Rubrics, calibration, and the disagreement that exposes the gap

Even when you have the right reviewers, the right tier, and the right capacity, you can still have a non-functional oversight process if reviewers are not calibrated to each other.

Calibration is uncomfortable to talk about because it implies that expert judgment can be wrong, and most organizations would rather not surface that. But inter-rater reliability is a basic measurement. If you give the same 20 outputs to two experienced reviewers and they disagree on 8 of them, you have a rubric problem, not a personnel problem. The rubric, or its absence, is producing inconsistent results, which means the "human review" you are counting on is measuring something different person to person.

The right response to calibration failures is not to pick the reviewer whose judgments you prefer. It is to go back to the rubric, find the specific dimensions where disagreement is highest, make the criteria more explicit, and re-run the calibration. This is tedious. It takes longer than just approving things. But it is the only way to turn reviewer judgment into something that can be sampled, audited, and improved over time.

The "A Field Guide to Evals" framework I have found most useful separates evaluation into three layers: outcome metrics (did the thing we wanted to happen happen?), output quality metrics (was the model's output correct by the rubric?), and process metrics (did the review happen on time, at the right sampling rate, with documented rationale?). Most human-review processes only measure the third layer, was the form filled out?, and assume that implies the first two.

Sampling plus evals as the actual plan

At production volume, the sustainable alternative to 100% human review is a combination of automated evals and structured sampling. This is not a novel idea, it is how quality control works in manufacturing, clinical trials, software QA, and financial auditing. You do not manually inspect every widget. You define acceptance criteria, sample at a statistically meaningful rate, measure defect rates, and set thresholds that trigger escalation.

Applied to AI outputs, this means:

Automated evals run on 100% of outputs. These are not LLM-as-judge opinion scores (though those have their place in calibration). They are deterministic checks: does the output contain required fields? Does it fall within acceptable ranges? Does it contradict a known fact in the patient's record? Does it reference something the model should not have access to? These checks catch the obvious failures at zero marginal cost per additional output.

Structured human sampling runs on a rate calibrated to your error tolerance. If your automated evals are catching 90% of failure modes, and your manual sampling over 500 outputs shows a 1.2% undetected error rate, you know your current eval coverage and can make an informed decision about whether 1.2% is acceptable given output stakes. That is a risk management decision, not a philosophical one.

Exception escalation pulls specific outputs to mandatory human review. The automated evals that fire, the outputs that fall below model confidence thresholds, the outputs that touch high-risk categories, these go to a human with a defined SLA and a rubric. The human is reviewing because something specific triggered escalation, not because the queue happened to route to them.

Oversight is a system with defined inputs, outputs, and failure modes, not a person whose title implies responsibility.

This architecture scales. When volume doubles, your eval infrastructure scales horizontally. Your human reviewers, now working on a sampled and exception-driven basis, maintain consistent quality because they are not drowning. The sampling rate and exception thresholds become the levers you tune as the model improves or as you discover new failure modes.

I am not going to lay out the mechanics of building that eval and sampling layer here, that is its own discipline, and I wrote a separate, more procedural piece on how to actually run human-in-the-loop evaluation. This essay is about why the one-line version is not a plan in the first place.

Autonomy boundaries that scale with evaluation strength

There is a principle that I think is underappreciated in conversations about human oversight: the appropriate level of AI autonomy should scale with the strength of your evaluation system, not with the novelty of the technology or the comfort level of leadership.

This sounds obvious but runs counter to how most rollouts are actually paced. Organizations typically start with high autonomy in the pilot (when everything is new and being watched closely), then add human review as they scale (when volume makes close watching impossible), then quietly reduce human review as the queue collapses (when they have the most volume and the least oversight). This is exactly backwards.

The right sequence is: start with 100% human review while you build your eval suite. As your evals gain coverage and you can measure model performance reliably, expand the model's autonomy in proportion to your measurement confidence. Never expand autonomy beyond the reach of your evaluation system. If you cannot measure it, you cannot govern it, and you certainly cannot catch it when it fails.

At Devlyn, we use the phrase "senior engineers own production readiness" to mean something specific about this. An engineer who ships an agentic feature does not hand it off to a reviewer and walk away. They own the eval coverage for that feature. They define what a failure looks like, instrument the detection, and set the thresholds. The "human in the loop" for their feature is a designed system they built and are accountable for, not a colleague they handed the queue to. When the eval coverage is strong and the error rate is low, they earn the right to reduce human review. When something unexpected shows up in the metrics, they increase it. Autonomy is a function of evidence, not of time elapsed since launch.

This does not mean every engineer needs to be an eval expert. It means that production readiness for AI features includes the oversight design as a first-class artifact, specified, reviewed, and revisited on a defined cadence.

Designing oversight as a system

The reframe I keep coming back to is this: oversight is infrastructure, not reassurance. "A human reviews it" is reassurance. It sounds responsible. It satisfies the compliance checkbox. It makes the slide deck feel complete. But it does not function as infrastructure because it has no defined inputs, no acceptance criteria, no failure modes, no scalability model, and no measurement framework.

Oversight infrastructure asks and answers: what is the expected error rate of this model in this context? What is the acceptable error rate given the stakes of each output tier? What automated checks catch which failure categories? At what sampling rate does human review add marginal value over the automated baseline? What does the escalation path look like, who is on it, and what SLA are they held to? What happens when the error rate crosses a threshold, does autonomy contract automatically, or does someone have to make a call?

These questions are not especially hard to answer. They are just boring to answer. They require the kind of careful, operational thinking that does not make the slide deck feel exciting. Nobody has ever gotten a standing ovation for presenting their inter-rater reliability calibration protocol. But this is exactly the work that separates organizations that have genuinely safe AI deployments from organizations that have AI deployments with a human somewhere nearby who can be blamed if something goes wrong.

The reviewer who becomes a rubber stamp is not the problem. The rubber stamp was the design. What you built was a system where volume would eventually exceed capacity, where the rubric was implicit and not transferable, where the reviewer had no way to know what they were supposed to be catching, and where nothing measured whether catching was actually happening. The human was a prop in a process that was never specified well enough to function.

If your current AI oversight plan would stop functioning correctly if that one reviewer took a two-week vacation, it is not a plan. It is a dependency. Plans survive the individuals inside them. Oversight systems should too.

Start with the questions. Write down the answers. Make the rubric. Do the capacity math. Build the evals. Set the thresholds. Then put a human in the loop, but only in the specific, instrumented, scalable way that actually means something.

If you are trying to build oversight that survives production volume rather than a reviewer who quietly drowns in it, this is the kind of work my team does at Devlyn.

Frequently asked questions

What does "human in the loop" actually mean for AI systems? In practice it should mean a designed oversight system: a specific reviewer or rotation, a defined rubric, a sampling rate, an authority to reject or escalate, and a latency window. Most of the time it means none of that, just a person named on a slide. If you cannot describe what a failure of your oversight process looks like, you do not have human in the loop, you have a dependency.

Why does human-in-the-loop review break down at scale? Volume grows faster than reviewer capacity. The reviewer starts skimming, then trusts the model because it has been right many times in a row, and the process drifts toward rubber-stamping without anyone measuring it. The failure is invisible: no crash, no error log, just a human who is technically reviewing and functionally not.

What is the alternative to reviewing every AI output by hand? Risk-tiered review plus automated evals and structured sampling. Run deterministic checks on 100% of outputs, sample human review at a rate calibrated to your error tolerance, and route exceptions and high-stakes cases to a reviewer with a rubric and an SLA. Then scale model autonomy in proportion to how well you can measure it, never beyond the reach of your evaluation system.