How to Vet AI Engineers: The Process That Predicts
How to vet AI engineers in a way that predicts on-the-job performance: the work-sample that mirrors real work, the judgment probe, references, and a paid trial.
How to vet AI engineers in a way that predicts on-the-job performance: the work-sample that mirrors real work, the judgment probe, references, and a paid trial.
How to vet AI engineers, in the order that actually predicts whether they will perform on the job: read their portfolio for evidence they shipped and measured something real, run one work-sample that mirrors the actual job instead of a coding test, probe how they knew their system worked and what they did when it was wrong, check references for patterns rather than praise, and end on a short paid trial that puts them next to real work. The resume comes last in weight, not first. Everything that feels efficient to over-weight, the credential, the leetcode round, the model trivia, is the part that has stopped predicting anything.
I vet AI engineers for a living. I run the loop, I hire them, I deploy them on revenue-bearing work, and then I find out within a few weeks whether my vet was any good. That feedback loop is brutal and it is honest, and it has rewritten how I screen. The first versions of my process selected for people who interviewed beautifully and then froze the first time an eval set disagreed with their gut. This piece is the process I run now, written from the hiring seat. If you are reading from the other chair, treat it as a map of what a careful evaluator is actually trying to learn about you.
- Resume signal is the weakest signal. Credentials, framework lists, and model trivia are all cheap to fake and easy to acquire in 2026. They are a filter for the obvious no, not a predictor of the yes.
- The work-sample must test judgment, not typing. With around 84% of developers now using AI coding tools, raw code output no longer reflects raw capability. Give them a real eval or debug task, not a blank-file coding round.
- "How did you know it worked?" is the fastest predictive filter. An engineer who cannot describe what they measured, what broke, and where a human sat in the loop has not shipped production AI, regardless of how the demo looks.
- Structured beats unstructured by roughly 2x. Same task, same rubric, scored on a scorecard. Gut-feel interviews are about half as predictive of on-the-job performance.
- The paid trial is the only test that fully predicts. A few days of real-ish work next to your team tells you more than every interview round combined.
Start with the signal you can trust: portfolio and GitHub, read correctly
Before any interview, I read what the candidate has actually built, and I read it the way a skeptic reads a P&L. The point of the portfolio pass is not to be impressed. It is to find the one project where they owned an AI system in production and to see whether they talk about it like someone who lived with its failures or someone who wrote it up for a resume.
What I read for: evidence of an eval set, even a crude one. A README that admits what the system gets wrong. A commit history that shows them fixing a real failure mode rather than chasing a green demo. A writeup that names a metric and a number, not just "improved accuracy." Those are the fingerprints of someone who has shipped and been held accountable for the output.
What I ignore: GitHub star counts, the length of the framework list, and the name of the vector database. None of those predict whether the person can tell a correct model output from a confidently wrong one. A repo with three hundred stars and no eval harness is a marketing artifact. A quiet repo with a frozen test set and an honest failure log is a hiring signal. To vet AI engineers well, you have to learn to read past the polish to the part that shows judgment, which is the same thing the hiring AI engineers pillar argues you are really buying.
The work-sample that mirrors the actual job, not a coding test
The single highest-validity thing you can do is give the candidate a task that looks like the work, then watch how they approach it. Work-sample tests have been near the top of the predictive-validity tables for decades, and they got more important the moment coding stopped being the bottleneck. The catch in 2026 is that the classic take-home no longer measures what it used to.
Here is the problem. The old take-home assumed the code a candidate produces reflects the candidate's capability. That assumption is dead. Around 84% of developers now use or plan to use AI tools in their workflow, per the most recent Stack Overflow Developer Survey, so a blank-file coding round mostly measures how well someone prompts a model you would have given them anyway. You are not learning what you think you are learning.
So I changed what the work-sample tests. Instead of "write this function," I give a task that is judgment-shaped and AI-resistant: here is a small RAG support bot that returns confident wrong answers about a third of the time, find out why and tell me what you would change. Or: here is an eval set and two model outputs, score them, defend your rubric, and tell me what the set is missing. The candidate can use any tool they want, because the thing I am scoring is not the typing. It is whether they reach for a measurement, whether they form a hypothesis before they touch code, and whether they can tell me what they are uncertain about.
One illustrative loop from my own files. A candidate with a thin resume and a state-school degree took the broken-RAG task, spent the first ten minutes building a tiny eval harness before changing anything, found the chunking bug, and then said the sentence that got him hired: "I would not ship this until I had thirty more labeled failures, this fix is a guess on a sample of one." He was right, and he was the strongest hire of that cohort. The credentialed candidate who skipped straight to a prompt tweak and declared it fixed was the one I would have over-weighted on paper.
Keep the work-sample short and paid if it runs past an hour. Score it on a written rubric you wrote before the session, because a structured work-sample scored the same way for every candidate is roughly twice as predictive as the same hour run on instinct. Schmidt and Hunter's long-standing meta-analysis put structured interviews at around 0.51 validity against 0.38 for unstructured, and more recent work from Sackett and colleagues revised the gap even wider, to roughly 0.42 versus 0.19. The discipline of sameness is not bureaucracy. It is the part that makes the comparison mean anything.
The fastest way to assess AI engineers: how they knew it worked
If I could keep only one question, it would be this: how did you know it worked? Then I follow the thread wherever it goes. What did you measure? On what set? What did the set miss? When the model was confidently wrong in front of a user, what did you do in the next hour, and what did you change so it would not happen again?
This is the question that separates engineers who shipped from engineers who demoed. Anyone can describe a RAG architecture in 2026; the docs are everywhere and the model will recite them for you. What cannot be faked is the texture of having owned a system that failed in front of real people. The engineer who has lived it answers with a metric, a failure mode, and a regret. The engineer who has not answers with an architecture diagram. The eval discipline I am probing for is the same muscle the whole field of LLM evaluation is built on, and it is the clearest dividing line between the two kinds of candidate.
I also probe for how they handle being wrong inside the interview itself. I will push back on a correct answer to see whether they fold or hold their ground with evidence, and I will hand them a genuinely ambiguous call to see whether they say "I do not know yet, here is how I would find out" or bluff. The bluff is disqualifying for production AI work, because a model that is confidently wrong is the whole job, and an engineer who is confidently wrong about the model is a force multiplier for the failure. The questions that do this well are the ones I collected in the piece on AI engineer interview questions; the short version is that you are testing for calibrated honesty, not recall.
Reference checks done right, including the backchannel
Most reference checks are theater. You call the three names the candidate handed you, you ask whether they would work with the person again, the reference says yes, and you have learned nothing. Done right, references are one of the highest-value parts of the process, and the trick is to ask for patterns instead of praise.
I ask references for the texture a resume hides. Tell me about a time the candidate was wrong and how they handled it. What would their last team say they need to work on. When something broke in production, were they the one who reached for the logs or the one who reached for an excuse. Then I shut up and let the pauses do the work, because the hesitation before a polite answer often says more than the answer.
The backchannel is where the real signal lives, and it has to be handled ethically. A backchannel reference is someone who worked with the candidate but is not on their list, reached through your own network. The hard rule is that you never contact anyone at the candidate's current employer, because exposing that someone is interviewing can cost them their job. Done within a trusted network and weighed only on patterns rather than one bitter anecdote, a backchannel will confirm your excitement or surface the blind spot the curated references were chosen to hide.
The paid trial: the only test that fully predicts
Every method above is a proxy. The paid trial is the real thing. A few days to two weeks of real-ish work, paid at a fair rate, next to your team, tells you what no interview can: how they communicate mid-task, how they handle a vague spec, whether their first instinct under real pressure is to measure or to guess, and whether the people around them want more of it.
I scope the trial around an actual ticket or a sanitized version of one, never a contrived puzzle. I watch three things specifically: do they ask the clarifying question before they build, do they leave the codebase more measurable than they found it, and do they tell me when they are stuck instead of going dark for two days. A second illustrative case from my files: a candidate who aced every interview round went quiet for three days during the trial, then surfaced a half-finished branch with no tests and no questions asked. The interviews said yes. The trial said no, and the trial was right. We had spent maybe a few hundred dollars to avoid a hire that would have cost months.
The honest trade-off is that strong candidates have options and will not all do a trial, and a genuinely mission-critical full-time seat sometimes warrants going straight to an offer on the strength of the work-sample and references. But for contractors, for fractional work, and for any hire where you can afford the time, the paid trial is the cheapest insurance you will ever buy against a six-figure mistake. If you would rather not run this gauntlet yourself, the engineers we place through Devlyn's AI application engineer hiring have already cleared a version of it on live production work, so the trial is effectively done before you meet them.
What NOT to over-weight
The mirror image of a good vet is knowing what to discount, because most broken hiring processes are not missing a step. They are over-weighting the wrong ones. Here is what I have learned to discount, and why each one feels predictive but is not.
Credentials and pedigree. A degree or a brand-name former employer tells you the person cleared someone else's bar years ago, not that they can tell a correct model output from a confidently wrong one today. I have hired state-school self-taught engineers who ran circles around PhDs on production judgment, and the reverse, and neither credential predicted the outcome.
Leetcode and algorithm puzzles. They measure a narrow, coachable skill that has almost no overlap with the actual job of shipping reliable AI features. Model trivia is the same trap one layer up: reciting attention mechanisms or naming the newest model is recall, and recall is the cheapest thing on the market now. The AI engineer skills that actually matter are judgment skills, and none of them show up on a whiteboard.
Demo polish and raw take-home output. A beautiful demo selects for presentation, and a clean take-home now selects for prompt quality, not capability. Both look like competence and predict almost nothing about how the person behaves when an eval set disagrees with them at 11pm before a launch.
A scorecard to vet AI engineers you can run
Here is the whole process as one scorecard: each signal, the weight I give it, and how I actually test it. The weights are mine and they are deliberately tilted toward the things that have predicted on-the-job performance in my own hiring, not toward the things that are easy to measure.
| Signal | Weight | How to test it |
|---|---|---|
| Eval and judgment ("how did you know it worked") | High | Live probe: metric, set, failure mode, what they changed |
| Work-sample on a real, AI-resistant task | High | Debug a broken RAG bot or score an eval set; rubric-scored |
| Paid trial on real-ish work | High | 3-10 days, real ticket, watch communication and measurement |
| References, read for patterns | Medium | Pattern questions plus an ethical backchannel |
| Portfolio / GitHub depth | Medium | Read for eval sets and honest READMEs, not stars |
| Handles being wrong (calibration) | Medium | Push back in-interview; reward "I do not know yet, here is how I would find out" |
| Credentials and pedigree | Low | Filter for the obvious no only; never a tiebreaker |
| Leetcode / model trivia | Very low | Skip; it measures recall, not production judgment |
Run the high-weight rows on every candidate in the same order with the same rubric, and you have a structured process that is roughly twice as predictive as the interview most teams actually run. The discipline is dull and the payoff is enormous, which is the usual shape of things that work. The full hiring loop this scorecard sits inside, including what good costs and how the bad hires fail, lives in the hiring AI engineers guide, and the deeper org playbook is in my book The AI-Native Team.
Frequently asked questions
How do you vet an AI engineer who has never shipped to production?
Substitute scope for production scars. Give the judgment-shaped work-sample, the broken-RAG or eval-scoring task, and weight how they reason about measurement and uncertainty rather than what they have shipped. A junior who builds a tiny eval harness before changing code is showing you the exact muscle you are buying. Just price the role and the risk accordingly, and lean harder on the paid trial.
Is a take-home test still worth giving in 2026?
Only if it tests judgment instead of typing. A blank-file coding round mostly measures how well someone prompts an AI tool, since around 84% of developers now use them. A take-home that asks the candidate to debug a confidently-wrong system or critique an eval set still works, because those are hard to fake with a model and they reveal how the person thinks.
How long should the paid trial be?
Long enough to see real behavior, short enough to be cheap insurance: three to ten days on a real or sanitized ticket is the usual range. You are watching for clarifying questions, measurement instinct, and honest communication when stuck. For a mission-critical full-time seat you can sometimes skip it on the strength of a strong work-sample and references, but for contract and fractional work it is the highest-value step you have.
What is the single fastest way to assess AI engineers?
Ask "how did you know it worked," then follow the thread. An engineer who answers with a metric, a set, and a failure mode has shipped and owned production AI. One who answers with an architecture diagram has not. It is not a complete vet on its own, but no other single question filters faster.
If you want the full hiring loop this fits into, the hiring AI engineers guide covers sourcing, cost, and failure modes, and The AI-Native Team goes deeper on the org around the hire. And if you would rather skip the whole gauntlet, the engineers placed through Devlyn's AI application engineer hiring have already cleared a harder version of this process on live work. Vet for judgment. Discount the rest.
